Intro 2 Polars

Background

Ritchie Vink, Rust, Apache Arrow and Covid

Here is the story, by the creator of Polars.

Who Can Benefit from Polars?

  • Researcher (DS, Analyst, Statistician, etc):
    • Working on their local machine.
    • Working on a cloud machine (SageMaker, EC2).
  • Production system:
    • Running on a dedicated server.
    • Running on “serverless” (e.g. AWS Lambda, Google Cloud Functions).

The DataFrame Landscape

Initially there were R’s data.frame. R has evolved, and it now offers tibbles and data.tables. Python had only Pandas for years. Then the Python ecosystem exploded, and now we have:

  • Pandas: The original Python dataframe module. Build by Wes McKinney, on top of numpy.
  • Polars: A new dataframe module, build by Ritchie Vink, on top of Rust and Apache Arrow.
  • DuckDB:
  • ClickHouse chDB
  • Datafusion
  • Databend
  • PyArrow
  • Daft: A distributed dataframe library built for “Complex Data” (data that doesn’t usually fit in a SQL table such as images, videos, documents etc).
  • Fugue: A dataframe library that allows you to write SQL-like code, and execute it on different backends (e.g. Spark, Dask, Pandas, Polars, etc).
  • pySpark: The Python API for Spark. Spark is a distributed computing engine, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
  • CUDF: A GPU accelerated dataframe library, build on top of Apache Arrow.
  • datatable: An attempt to recreate R’s data.table API and (crazy) speed in Python.
  • Dask: A distributed computing engine for Python, with support for distributing data over multiple processes running Pandas (or numpy, Polars, etc).
  • Vaex: A high performance Python library for lazy Out-of-Core DataFrames (similar to dask, but with a different API).
  • Modin: A drop-in distributed replacement for Pandas, built on top of Ray.

For more details see here, here, here.

Motivation to Use Polars

Each of the following, alone(!), is amazing.

  1. Out of the box parallelism.
  2. Lazy Evaluation: With query planning and query optimization.
  3. Streaming engine: Can stream data from disk to memory for out-of-memory processing.
  4. A complete set of native dtypes; including missing and strings.
  5. An intuitive and consistent API; inspired by PySpark.

Setting Up the Environment

At this point you may want to create and activate a venv for this project.

# %pip install -r requirements.txt
%pip show Polars # check you Polars version
Name: polars
Version: 0.20.15
Summary: Blazingly fast DataFrame library
Home-page: 
Author: 
Author-email: Ritchie Vink <ritchie46@gmail.com>
License: 
Location: /Users/johnros/workspace/polars_demo/.venv/lib/python3.9/site-packages
Requires: 
Required-by: 
Note: you may need to restart the kernel to use updated packages.
%pip show Pandas # check you Pandas version
Name: pandas
Version: 2.2.0
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: The Pandas Development Team <pandas-dev@python.org>
License: BSD 3-Clause License
        
        Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
        All rights reserved.
        
        Copyright (c) 2011-2023, Open source contributors.
        
        Redistribution and use in source and binary forms, with or without
        modification, are permitted provided that the following conditions are met:
        
        * Redistributions of source code must retain the above copyright notice, this
          list of conditions and the following disclaimer.
        
        * Redistributions in binary form must reproduce the above copyright notice,
          this list of conditions and the following disclaimer in the documentation
          and/or other materials provided with the distribution.
        
        * Neither the name of the copyright holder nor the names of its
          contributors may be used to endorse or promote products derived from
          this software without specific prior written permission.
        
        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Location: /Users/johnros/workspace/polars_demo/.venv/lib/python3.9/site-packages
Requires: numpy, python-dateutil, pytz, tzdata
Required-by: bokeh, category-encoders, great-tables, holoviews, hvplot, mizani, panel, seaborn, statsmodels
Note: you may need to restart the kernel to use updated packages.
import polars as pl
pl.Config(fmt_str_lengths=50)
import polars.selectors as cs

import pandas as pd
import numpy as np
import pyarrow as pa
import plotly.express as px
import string
import random
import os
import sys
%matplotlib inline 
import matplotlib.pyplot as plt
from datetime import datetime

# Following two lines only required to view plotly when rendering from VScode. 
import plotly.io as pio
# pio.renderers.default = "plotly_mimetype+notebook_connected+notebook"
pio.renderers.default = "plotly_mimetype+notebook"

What Polars module and dependencies are installed?

pl.show_versions()
--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             macOS-14.3.1-arm64-arm-64bit
Python:               3.9.6 (default, Nov 10 2023, 13:38:27) 
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.9.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.6.3
numpy:                1.24.1
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.46
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

How many cores are available for parallelism?

pl.thread_pool_size()
8

Memory Footprint

Memory Footprint of Storage

Comparing Polars to Pandas - the memory footprint of a series of strings.

Polars.

letters = pl.Series(list(string.ascii_letters))

n = int(10e6)
letter1 = letters.sample(n, with_replacement=True)

letter1.estimated_size(unit='gb') 
0.14901161193847656

Pandas (before Pandas 2.0.0).

# Pandas with numpy backend

letter1_Pandas = pd.Series(list(string.ascii_letters)).sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=False) 

letter1_Pandas.memory_usage(deep=True, index=True) / 1e9
0.66

Pandas after Pandas 2.0, with the Pyarrow backend (Apr 2023).

letter1_Pandas = pd.Series(list(string.ascii_letters), dtype="string[pyarrow]").sample(n, replace=True)
# Alternatively: letter1_Pandas = letter1.to_pandas(use_pyarrow_extension_array=True) 

letter1_Pandas.memory_usage(deep=True, index=True) / 1e9
0.17125

Lazy Frames and Query Planning

Consider a sort operation that follows a filter operation. Ideally, filter precedes the sort, but we did not ensure this… We now demonstrate that Polars’ query planner will do it for you. En passant, we see Polars is more efficient also without the query planner.

Polars’ eager evaluation in the wrong order. Sort then filter.

%timeit -n 2 -r 2 letter1.sort().filter(letter1.is_in(['a','b','c']))
467 ms ± 18.9 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Polars’ Eager evaluation in the right order. Filter then sort.

%timeit -n 2 -r 2 letter1.filter(letter1.is_in(['a','b','c'])).sort()
73.1 ms ± 350 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)

Now prepare a Polars LazyFrame required for query optimization.

latter1_lazy = letter1.alias('letters').to_frame().lazy()

Polars’ Lazy evaluation in the wrong order; without query planning

%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect(no_optimization=True)
424 ms ± 1.82 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Polars’ Lazy evaluation in the wrong order; with query planning

%timeit -n 2 -r 2 latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).collect()
75.6 ms ± 6.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Things to note:

  1. A lazy evaluation was triggered when df.lazy() converted the Polars DataFrame to a Polars LazyFrame.
  2. The query planner worked: The Lazy evaluation in the wrong order timed as much as an eager evaluation in the right order; even when accounting for the overhead of converting the frame from eager to lazy.

Here is the actual query plan of each. The non-optimized version:

latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=False)

latter1_lazy.sort(by='letters').filter(pl.col('letters').is_in(['a','b','c'])).show_graph(optimized=True)

Now compare to Pandas…

Pandas’ eager evaluation in the wrong order.

%timeit -n1 -r1 letter1_Pandas.sort_values().loc[lambda x: x.isin(['a','b','c'])]
2.19 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Pandas eager evaluation in the right order: Filter then sort.

%timeit -n1 -r1 letter1_Pandas.loc[lambda x: x.isin(['a','b','c'])].sort_values()
135 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Pandas without lambda syntax.

%timeit -n 2 -r 2 letter1_Pandas.loc[letter1_Pandas.isin(['a','b','c'])].sort_values()
135 ms ± 1.24 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Things to note:

  1. Query planning works!
  2. Pandas has dramatically improved since <2.0.0.
  3. Lambda functions are always slow (both Pandas and Polars).

For a full list of the operations that are optimized by Polars’ query planner see here.

Optimized for Within-Column Operations

Polars seamlessly parallelizes over columns (also within, when possible). As the number of columns in the data grows, we would expect fixed runtime until all cores are used, and then linear scaling. The following code demonstrates this idea, using a simple sum-within-column.

# M ac users with Apple silicon (M1 or M2) may also want to benchmark Apples' mlx:
# %pip install mlx
import mlx.core as mx
# Maker an array of floats.
A_numpy = np.random.randn(int(1e6), 10)

A_Polars = pl.DataFrame(A_numpy)
A_Pandas_numpy = pd.DataFrame(A_numpy)
A_Pandas_arrow = pd.DataFrame(A_numpy, dtype="float32[pyarrow]")
# A_arrow = pa.Table.from_Pandas(A_Pandas_numpy) # no sum method
A_mlx = mx.array(A_numpy)

Candidates currently omited:

  1. JAX
  2. PyTorch
  3. TensorFlow
  4. …?

Summing Over Columns

%timeit -n 4 -r 2 A_numpy.sum(axis=0)
8.14 ms ± 21.4 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_numpy.sum(axis=0).shape
(10,)
%timeit -n 4 -r 2 A_Polars.sum()
4.16 ms ± 2.48 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Polars.sum().shape
(1, 10)
%timeit -n 4 -r 2 A_mlx.sum(axis=0)
The slowest run took 163.03 times longer than the fastest. This could mean that an intermediate result is being cached.
171 µs ± 169 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_mlx.sum(axis=0).shape
(10,)

50 Shades of Pandas

Pandas with numpy backend

%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=0)
25.6 ms ± 433 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Pandas_numpy.sum(axis=0).shape
(10,)

Pandas with arrow backend

%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=0)
3.74 ms ± 202 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Pandas_arrow.sum(axis=0).shape
(10,)

Pandas with numpy backend, converted to numpy

%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=0)
11.7 ms ± 2.66 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Pandas_numpy.values.sum(axis=0).shape
(10,)

Pandas with arrow backend, converted to numpy

%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=0)
403 ms ± 10.6 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
type(A_Pandas_arrow.values)
numpy.ndarray
A_Pandas_arrow.values.sum(axis=0).shape
(10,)

Pandas to mlx

%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=0)
4.8 ms ± 2.3 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
mx.array(A_Pandas_numpy.values).sum(axis=0).shape
(10,)

Summing Over Rows

%timeit -n 4 -r 2 A_numpy.sum(axis=1)
8.08 ms ± 160 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_numpy.sum(axis=1).shape
(1000000,)
%timeit -n 4 -r 2 A_Polars.sum_horizontal()
7.85 ms ± 2.83 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_Polars.sum_horizontal().shape
(1000000,)
%timeit -n 4 -r 2 A_mlx.sum(axis=1)
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
6.35 µs ± 4.35 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)
A_mlx.sum(axis=1).shape
(1000000,)

50 Shades of Pandas

Pandas with numpy backend

%timeit -n 4 -r 2 A_Pandas_numpy.sum(axis=1)
70.5 ms ± 1.35 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)

Pandas with arrow backend

%timeit -n 4 -r 2 A_Pandas_arrow.sum(axis=1)
188 ms ± 8.93 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)

Pandas with numpy backend, converted to numpy

%timeit -n 4 -r 2 A_Pandas_numpy.values.sum(axis=1)
8.05 ms ± 99.5 µs per loop (mean ± std. dev. of 2 runs, 4 loops each)

Pandas with arrow backend, converted to numpy

%timeit -n 4 -r 2 A_Pandas_arrow.values.sum(axis=1)
417 ms ± 18.3 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)

Pandas to mlx

%timeit -n 4 -r 2 mx.array(A_Pandas_numpy.values).sum(axis=1)
5.25 ms ± 2.56 ms per loop (mean ± std. dev. of 2 runs, 4 loops each)

Speed Of Import

Polar’s pl.read_x functions are quite faster than Pandas. This is due to parallelism, better type “guessing”.

We benchmark by making synthetic data, save it on disk, and reimporting it.

CSV Format

n_rows = int(1e5)
n_cols = 10
data_Polars = pl.DataFrame(np.random.randn(n_rows,n_cols))
data_Polars.write_csv('data/data.csv', include_header = False)
f"{os.path.getsize('data/data.csv')/1e7:.2f} MB on disk"
'1.96 MB on disk'

Import with Pandas.

%timeit -n2 -r2 data_Pandas = pd.read_csv('data/data.csv', header = None)
97.9 ms ± 2.66 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Import with Polars.

%timeit -n2 -r2 data_Polars = pl.read_csv('data/data.csv', has_header = False)
10.3 ms ± 2.05 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

What is the ratio of times on your machine? How many cores do you have?

Parquet Format

data_Polars.write_parquet('data/data.parquet')
f"{os.path.getsize('data/data.parquet')/1e7:.2f} MB on disk"
'0.78 MB on disk'
%timeit -n2 -r2 data_Pandas = pd.read_parquet('data/data.parquet')
The slowest run took 6.02 times longer than the fastest. This could mean that an intermediate result is being cached.
21.7 ms ± 15.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
%timeit -n2 -r2 data_Polars = pl.read_parquet('data/data.parquet')
5.88 ms ± 1.96 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Feather (Apache IPC) Format

data_Polars.write_ipc('data/data.feather')
f"{os.path.getsize('data/data.feather')/1e7:.2f} MB on disk"
'0.80 MB on disk'
%timeit -n2 -r2 data_Polars = pl.read_ipc('data/data.feather')
The slowest run took 16.90 times longer than the fastest. This could mean that an intermediate result is being cached.
454 µs ± 404 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
%timeit -n2 -r2 data_Pandas = pd.read_feather('data/data.feather')
4.37 ms ± 736 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)

Pickle Format

import pickle
pickle.dump(data_Polars, open('data/data.pickle', 'wb'))
f"{os.path.getsize('data/data.pickle')/1e7:.2f} MB on disk"
'0.80 MB on disk'
%timeit -n2 -r2 data_Polars = pickle.load(open('data/data.pickle', 'rb'))
2.99 ms ± 552 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)

Summarizing Import

Things to note:

  1. The difference in speed is quite large between Pandas vs. Polars.
  2. When dealing with CSV’s, the function pl.read_csv reads in parallel, and has better type guessing heuristics.
  3. The difference in speed is quite large between csv vs. parquet and feather, with feather<parquet<csv.
  4. Feather is the fastest, but larger on disk. Thus good for short-term storage, and parquet for long-term.
  5. The fact that pickle isn’t the fastest surprised me; but then again, it is not optimized for data.

Speed Of Join

Because Pandas is built on numpy, people see it as both an in-memory database, and a matrix/array library. With Polars, it is quite clear it is an in-memory database, and not an array processing library (despite having a pl.dot() function for inner products). As such, you cannot multiply two Polars dataframes, but you can certainly join then efficiently.

Make some data:

def make_data(n_rows, n_cols):
  data = np.concatenate(
  (
    np.arange(n_rows)[:,np.newaxis], # index
    np.random.randn(n_rows,n_cols), # values
    ),
    axis=1)
    
  return data


n_rows = int(1e7)
n_cols = 10
data_left = make_data(n_rows, n_cols)
data_right = make_data(n_rows, n_cols)

data_left.shape
(10000000, 11)

Polars Join

data_left_Polars = pl.DataFrame(data_left)
data_right_Polars = pl.DataFrame(data_right)

%timeit -n2 -r2 Polars_joined = data_left_Polars.join(data_right_Polars, on = 'column_0', how = 'left')
783 ms ± 189 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Pandas Join

data_left_Pandas = pd.DataFrame(data_left)
data_right_Pandas = pd.DataFrame(data_right)

%timeit -n2 -r2 Pandas_joined = data_left_Pandas.merge(data_right_Pandas, on = 0, how = 'inner')
2.46 s ± 475 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

The NYC Taxi Dataset

Empirical demonstration: Load the celebrated NYC taxi dataset, filter some rides and get the mean tip_amount by passenger_count.

path = 'data/NYC' # Data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
file_names = os.listdir(path)

Pandas

df.query() syntax.

%%time 
taxi_Pandas = pd.read_parquet(path)

query = '''
    passenger_count > 0 and 
    passenger_count < 5 and  
    trip_distance >= 0 and 
    trip_distance <= 10 and 
    fare_amount >= 0 and 
    fare_amount <= 100 and 
    tip_amount >= 0 and 
    tip_amount <= 20 and 
    total_amount >= 0 and 
    total_amount <= 100
    '''.replace('\n', '')
taxi_Pandas.query(query).groupby('passenger_count').agg({'tip_amount':'mean'})
CPU times: user 1.79 s, sys: 1.46 s, total: 3.25 s
Wall time: 3.44 s
tip_amount
passenger_count
1.0 2.843313
2.0 2.844228
3.0 2.777564
4.0 2.648380

Well, the df.loc[] syntax is usually faster than the query syntax:

%%time 
taxi_Pandas = pd.read_parquet(path)

ind = (
    taxi_Pandas['passenger_count'].between(1,4) 
    & taxi_Pandas['trip_distance'].between(0,10) 
    & taxi_Pandas['fare_amount'].between(0,100) 
    & taxi_Pandas['tip_amount'].between(0,20) 
    & taxi_Pandas['total_amount'].between(0,100)
)
(
    taxi_Pandas[ind]
    .groupby('passenger_count')
    .agg({'tip_amount':'mean'})
)
CPU times: user 1.8 s, sys: 1.59 s, total: 3.38 s
Wall time: 2.22 s
tip_amount
passenger_count
1.0 2.843313
2.0 2.844228
3.0 2.777564
4.0 2.648380

Polars Lazy In Memory

%%time 

import pyarrow.dataset as ds

q = (
    
    pl.scan_pyarrow_dataset(
      ds.dataset("data/NYC", format="parquet") # Using PyArrow's Parquet reader
    )
    
    # pl.scan_parquet("data/NYC/*.parquet") # will now work because parquet was created with Int32, and not Int64. 
    
    .filter(
        pl.col('passenger_count').is_between(1,4),
        pl.col('trip_distance').is_between(0,10),
        pl.col('fare_amount').is_between(0,100),
        pl.col('tip_amount').is_between(0,20),
        pl.col('total_amount').is_between(0,100)
    )
    .group_by(
      pl.col('passenger_count')
      )
    .agg(
      pl.col('tip_amount').mean().name.suffix('_mean')
      ) 
    )

q.collect()
CPU times: user 499 ms, sys: 310 ms, total: 809 ms
Wall time: 1.21 s
shape: (4, 2)
passenger_count tip_amount_mean
f64 f64
1.0 2.843313
2.0 2.844228
4.0 2.64838
3.0 2.777564
q.show_graph(optimized=True) # Graphviz has to be installed

Things to note:

  1. I did not use the native pl.scan_parquet() as it is recommended. For your purposes, you will almost always use the native readers. It is convenient to remember, however, that you can also use the PyArrow importers in the native importers fail.
  2. I only have 2 parquet files. When I run the same with more files, despite my 16GB of RAM, Pandas will crash my python kernel.

Polars Lazy From Disk

Important

The following shows how to use the Polars streaming engine. This is arguably the biggest difference with Pandas, and other in memory dataframe libraries.

q.collect(streaming=True)
shape: (4, 2)
passenger_count tip_amount_mean
f64 f64
3.0 2.777564
4.0 2.64838
2.0 2.844228
1.0 2.843313

Enough with motivation. Let’s learn something!

Preliminaries

Object Classes

  • Polars Series: Like a Pandas series. An in-memory array of data, with a name, and a dtype.

  • Polars DataFrame: A collection of Polars Series. This is the Polars equivalent of a Pandas DataFrame. It is eager, and does not allow query planning.

  • Polars Expr: A Polars series that is not yet computed, and that will be computed when needed. A Polars Expression can be thought of as:

    1. A Lazy Series: A series that is not yet computed, and that will be computed when needed.
    2. A function: That maps a Polars expression to another Polars expression, and can thus be chained.
  • Polars LazyFrame: A collection of Polars Expressions. This is the Polars equivalent of a Spark DataFrame. It is lazy, thus allows query planning.

Warning

Not all methods are implemented for all classes. In particular, not all pl.Dataframe methods are implemented for pl.LazyFrame and vice versa. The same goes for pl.Series and pl.Expr.

This is not because the developers are lazy, but because the API is still being developed, and there are fundamental differences between the classes.

Think about it:

  1. Why do we not see a df.height attribute for a pl.LazeFrame?
  2. Why do we not see a df.sample() method for a pl.LazyFrame?

Evaluation Engines

Polars has (seemingly) 2 evaluation engines:

  1. Eager: This is the default. It is the same as Pandas. When you call an expression, it is immediately evaluated.
  2. Lazy: This is the same as Spark. When you call an expression, it added to a chain of expressions which make a query plan. The query plan is optimized and evaluated when you call .collect().

Why “seemingly” 2? Because each engine has it’s own subtleties. For instance, the behavior of the lazy engine may depend on streaming VS non-streaming evaluation, and on the means of loading the data.

  1. Streaming or Not?: This is a special case of lazy evaluation. It is used when you want to process your data out of RAM. You can then call .collect(streaming=True) to process the dataset in chunks.
  2. Native Loaders or Not?: Reading multiple parquet files using Polars native readers, may behave slightly different than reading the same files as a Pyarrow dataset (always prefer the native readers, when possible).

Polars dtypes

Polars has its own dtypes, called with pl.<dtype>; e.g. pl.Int32. A comprehensive list may be found here.

Here are the most common. Note, that unlike Pandas (<2.0.0), all are native Polars dtypes, and do not recur to Python objects.

  • Floats:
    • pl.Float64: As the name suggests.
  • Integers:
    • pl.Int64: As the name suggests.
  • Booleans:
    • pl.Boolean: As the name suggests.
  • Strings:
    • pl.Utf8: The only string encoding supported by Polars.
    • pl.String: Recently introduced as an alias to pl.Utf8.
    • pl.Categorical: A string that is encoded as an integer.
    • pl.Enum: Short for “enumerate”. A categorical with a fixed set of values.
  • Temporal:
    • pl.Date: Date, without time.
    • pl.Datetime: Date, with time.
    • pl.Time: Time, without date.
    • pl.Duration: Time difference.
  • Nulls:
    • pl.Null: Polars equivalent of Python’s None.
    • np.nan: The numpy dtype. Essentially a float, and not as a null.
  • Nested:
    • pl.List: A list of items.
    • pl.Array: A fixed length list.
    • pl.Struct: Think of it as a dict within a frame.

Things to note:

  • Polars has no pl.Int dtype, nor pl.Float. You must specify the number of bits.
  • Polars also supports np.nan(!), which is different than its pl.Null dtype. np.nan is a float, and Null is a None.

The Polars API

  • You will fall in love with it!
  • Much more similar to PySpark than to Pandas. The Pandas API is simply not amenable to lazy evaluation. If you are familiar with PySpark, you should feel at home pretty fast.

Some Design Principles

Here are some principles that will help you understand the API:

  1. All columns are created equal. There are no indexing columns.

  2. Operations on the columns of a dataframe will always be part of a context. Context may include:

    • pl.DataFrame.select(): This is the most common context. Just like a SQL SELECT, it is used to select and transform columns.
    • pl.DataFrame.with_columns(): Transform columns but return all columns in the frame; not just the transformed ones.
    • pl.DataFrame.group_by().agg(): The .agg() context works like a .select() context, but it is used to apply operations on sub-groups of rows.
    • pl.DataFrame.filter(): This is used to filter rows using expressions that evaluate to Booleans.
    • pl.SQLContext().execute(): This is used if you prefer to use SQL syntax, instead of the Polars API.
  3. Nothing happens “in-place”.

  4. Two-word methods are always lower-case, and separated by underscores. E.g: .is_in() instead of .isin(); .is_null() instead of .isnull(); .group_by() instead of .group_by() (starting version 19.0.0).

  5. Look for pl.Expr() methods so you can chain operations. E.g. pl.col('a').add(pl.col('b')) is better than pl.col('a') + pl.col('b'); the former can be further chained. And there is always .pipe().

  6. Polars was designed for operation within columns, not within rows. Operations within rows are possible via:

    • Polars functions with a _horizontal() suffix. Examples: pl.sum_horizontal(), pl.mean_horizontal(), pl.rolling_sum_horizontal().
    • Combining columns into a single column with nested dtype. Examples: pl.list(), pl.array(), pl.struct().
  7. Always remember the class you are operating on. Series, Expressions, DataFrames, and LazyFrames, have similar but-not-identical methods.

Some Examples of the API

Here is an example to give you a little taste of what the API feels like.

# Make some data
polars_frame = pl.DataFrame(make_data(100,4))
polars_frame.limit(5)
shape: (5, 5)
column_0 column_1 column_2 column_3 column_4
f64 f64 f64 f64 f64
0.0 -0.024173 0.331492 -0.723544 1.618408
1.0 -1.651808 -1.045469 -0.54715 -1.560864
2.0 0.567596 -0.337249 0.681728 -0.805223
3.0 -0.079549 1.553233 2.72464 2.3072
4.0 0.866269 -0.348577 -0.219935 -0.142849
Note

What is the difference between .head() and limit()? For eager frames? For lazy frames?

Can you parse the following in your head?

(
  polars_frame
  .rename({'column_0':'group'})
  .with_columns(
    pl.col('group').cast(pl.Int32),
    pl.col('column_1').ge(0).alias('non-negative'),
  )
  .group_by('non-negative')
  .agg(
    pl.col('group').is_between(1,4).sum().alias('one-to-four'),
    pl.col('^column_[0-9]$').mean().name.suffix('_mean'),
  )
)
shape: (2, 6)
non-negative one-to-four column_1_mean column_2_mean column_3_mean column_4_mean
bool u32 f64 f64 f64 f64
true 2 0.72212 0.152786 0.015856 -0.182337
false 2 -0.806728 0.098868 0.178319 0.230414

Ask yourself:

  • What is polars_frame? Is it an eager or a lazy Polars dataframe?
  • Why is column_1_mean when non-negative=false indeed non negative?
  • What is a Polars expression?
  • What is a Polars series?
  • How did I create the columns column_1_meancolumn_4_mean?
  • How would you have written this in Pandas?
(
  polars_frame
  .rename({'column_0':'group'})
  .select(
    pl.col('group').mod(2),

    pl.mean_horizontal(
      pl.col('^column_[0-9]$')
    )
    .name.suffix('_mean')
  )
  .filter(
    pl.col('group').eq(1),
    pl.col('column_1_mean').gt(0)
  )
)
shape: (25, 2)
group column_1_mean
f64 f64
1.0 1.626381
1.0 0.245842
1.0 0.418367
1.0 0.124992
1.0 0.366748
1.0 0.459953
1.0 0.496322
1.0 0.139168
1.0 0.984494
1.0 0.721107

Try parsing the following in your head:

polars_frame_2 = (
  pl.DataFrame(make_data(100,1))
  .select(
    pl.col('*').name.suffix('_second')
  )
)



(
  polars_frame
  .join(
    polars_frame_2,
    left_on = 'column_0',
    right_on = 'column_0_second',
    how = 'left',
    validate='1:1'
  )
)
shape: (100, 6)
column_0 column_1 column_2 column_3 column_4 column_1_second
f64 f64 f64 f64 f64 f64
0.0 -0.024173 0.331492 -0.723544 1.618408 1.88575
1.0 -1.651808 -1.045469 -0.54715 -1.560864 -1.05814
2.0 0.567596 -0.337249 0.681728 -0.805223 -0.525377
3.0 -0.079549 1.553233 2.72464 2.3072 -0.198516
4.0 0.866269 -0.348577 -0.219935 -0.142849 2.145621
95.0 0.882176 0.34941 -0.893759 -0.766842 -0.14159
96.0 -0.65607 0.436874 -0.647228 -0.624721 -1.078866
97.0 0.762848 1.270774 0.118442 0.732365 -0.424409
98.0 0.935173 0.721465 -0.503583 0.111514 1.872621
99.0 -1.041987 -0.38698 0.581323 -1.577651 0.009217

Can you parse the following in your head?

polars_frame_3 = pl.DataFrame(make_data(10,1))

(
  polars_frame.lazy()
  .select(
    pl.col('*').name.suffix('_first')
  )
  .with_context(
    polars_frame_3.lazy()
    .select(
      pl.col('*').name.suffix('_third')
    ) 
  )
  .select(
    'column_0_first',

    pl.when(
      pl.col('column_0_first').mod(2).eq(0)
      )
    .then(
      pl.lit(1)
      )
    .otherwise(
      polars_frame_3
      .select(
        pl.last().min()
        )
      )
  )
  .collect(streaming=True)
)
shape: (100, 2)
column_0_first literal
f64 f64
0.0 1.0
1.0 -1.369263
2.0 1.0
3.0 -1.369263
4.0 1.0
95.0 -1.369263
96.0 1.0
97.0 -1.369263
98.0 1.0
99.0 -1.369263

Getting Help

Before we dive in, you should be aware of the following references for further help:

  1. A github page. It is particular important to subscribe to release updates.
  2. A user guide.
  3. A very active community on Discord.
  4. The API reference.
  5. A Stack-Overflow tag.
  6. Cheat-sheet for Pandas users.

Warning: Be careful of AI assistants such as Github-Copilot, TabNine, etc. Polars is still very new, and they may give you Pandas completions instead of Polars.

Polars Series

A Polars series looks a feels a lot like a Pandas series. Getting used to Polars Series, will thus give you bad intuitions when you move to Polars Expressions.

Construct a series

s = pl.Series("a", [1, 2, 3])
s
shape: (3,)
a
i64
1
2
3

Make Pandas series for later comparisons.

s_Pandas = pd.Series([1, 2, 3], name = "a")
s_Pandas
0    1
1    2
2    3
Name: a, dtype: int64

Notice even the printing to notebooks is different.

Now verify the type

type(s)
polars.series.series.Series
type(s_Pandas)
pandas.core.series.Series
s.dtype
Int64
s_Pandas.dtype
dtype('int64')

Renaming a series; will be very useful when operating on dataframe columns.

s.alias("b")
shape: (3,)
b
i64
1
2
3

Constructing a series of floats, for later use.

f = pl.Series("a", [1., 2., 3.])
f
shape: (3,)
a
f64
1.0
2.0
3.0
f.dtype
Float64

Export To Other Python Objects

The current section deals with exports to other python objects, in memory. See Section 6.2 for exporting to disk.

Export to Polars DataFrame.

s.to_frame() 
shape: (3, 1)
a
i64
1
2
3

Export to Python list.

s.to_list()
[1, 2, 3]

Export to Numpy array.

s.to_numpy() # useful for preparing data for learning with scikit-learn
array([1, 2, 3])

Export to Pandas Series with Numpy backend.

s.to_pandas()
0    1
1    2
2    3
Name: a, dtype: int64

Export to Pandas Series with Arrow backend.

s.to_pandas(use_pyarrow_extension_array=True)
0    1
1    2
2    3
Name: a, dtype: int64[pyarrow]

Export to Arrow Array.

s.to_arrow() 
<pyarrow.lib.Int64Array object at 0x29b27c4c0>
[
  1,
  2,
  3
]

Scikit-learn currently does not support Arrow, and may actually support Polars soon enough. XGBoost, however, does support Arrow.

Memory Representation of Series

Object size in memory. Super useful for profiling. Will only be available for eager objects; by definitions.

s.estimated_size(unit="b") # 8(bytes) * 3(length)
24

Shrink the memory allocation to the size of the actual data (in place).

s.shrink_to_fit() 
shape: (3,)
a
i64
1
2
3

Filtering and Subsetting

Filter

s[0] # same as s.__getitem__(0)
1

To filter, you need to use the .filter() method; which can accept a list or a Polars series (I did not try other iterables).

s.filter([True, False, True])
shape: (2,)
a
i64
1
3
s.filter(pl.Series("a", [True, False, True])) # works
shape: (2,)
a
i64
1
3

You can filter along any expression that evaluates to a boolean list/series.

s.filter(s.ge(2))
shape: (2,)
a
i64
2
3

Filtering with the [ operator will not work:

s[[True, False, True]]

Subset Using Location

s.limit(2)
shape: (2,)
a
i64
1
2
s.head(2)
shape: (2,)
a
i64
1
2
s.tail(2)
shape: (2,)
a
i64
2
3
s.gather([0, 2]) # same Pandas .iloc[[0,2]] or as s[0,2] and
shape: (2,)
a
i64
1
3

Note: Unlike the [ operator, .gather() is also a pl.Expr() method, so will work in a lazy frame, and it may be chained.

s.slice(1, 2) # same as Pandas .iloc[1:2]
shape: (2,)
a
i64
2
3
s.gather_every(2) # same as Pandas .iloc[::2]
shape: (2,)
a
i64
1
3

Aggregations

Statistical Aggregations

s.sum()
6
s.min()
1
s.arg_min()
0
s.max()
3
s.arg_max()
2
s.peak_max()
shape: (3,)
a
bool
false
false
true
s.mean()
2.0
s.median()
2.0
s.quantile(0.2)
1.0
s.entropy()
-4.68213122712422
s.describe() 
shape: (9, 2)
statistic value
str f64
"count" 3.0
"null_count" 0.0
"mean" 2.0
"std" 1.0
"min" 1.0
"25%" 2.0
"50%" 2.0
"75%" 3.0
"max" 3.0

Polars pl.series.describe() is almost the same as Pandas pd.series.describe().

s_Pandas.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: a, dtype: float64
s.value_counts()
shape: (3, 2)
a count
i64 u32
2 1
1 1
3 1

Logical Aggregations

b = pl.Series("a", [True, True, False])
b.dtype
Boolean
b.all()
False
b.any()
True
b.not_()
shape: (3,)
a
bool
false
false
true

Missing

Thanks to Arrow, Polars has built in missing value support for all(!) dtypes. This has been a long awaited feature in the Python data science ecosystem with implications on speed, memory, style and more. The Polars User Guide has a great overview of the topic from which we collect some take-homes:

  • np.nan is also supported along pl.Null, but is not considered as a missing value by Polars. This has implications on null counts, statistical aggregations, etc.
  • pl.Null, and np.nans have their own separate functions for imputing, counting, etc.
m = pl.Series("a", [1, 2, None, np.nan])
m.is_null() # checking for None's. Like Pandas .isna()
shape: (4,)
a
bool
false
false
true
false
m.is_nan() # checking for np.nan's
shape: (4,)
a
bool
false
false
null
true

For comparison with Pandas:

m_Pandas = pd.Series([1, 2, None, np.nan])
m_Pandas.isna()
0    False
1    False
2     True
3     True
dtype: bool
m_Pandas.isnull() # alias for pd.isna()
0    False
1    False
2     True
3     True
dtype: bool

Operating on Missing

We now compare the behavior of Polars to Pandas when operating on missing values. When interpreting the following remember:

  1. For Polars, nan is not missing. It is some unknown number.
  2. For Pandas, nan and Nulls are the same.
# Polars
m1 = pl.Series("a", [1, None, 2, ]) # python native None
m2 = pl.Series("a", [1, np.nan, 2, ]) # numpy's nan
m3 = pl.Series("a", [1, float('nan'), 2, ]) # python's nan

# Pandas
m4 = pd.Series([1, None, 2 ])
m5 = pd.Series([1, np.nan, 2, ])
m6 = pd.Series([1, float('nan'), 2, ])
[
  m1.sum(), 
  m2.sum(), 
  m3.sum(), 
  m4.sum(), 
  m5.sum(), 
  m6.sum(),
  ]
[3, nan, nan, 3.0, 3.0, 3.0]

Things to note:

  • None will be ignored by both, which is unsafe.
  • np.nan will be ignored by Pandas (unsafe), but not by Polars (safe).

Filling missing values; None and np.nan are treated differently:

m1.fill_null(0)
shape: (3,)
a
i64
1
0
2
m2.fill_null(0)
shape: (3,)
a
f64
1.0
NaN
2.0
m3.fill_nan(0)
shape: (3,)
a
f64
1.0
0.0
2.0
m1.drop_nulls()
shape: (2,)
a
i64
1
2
m1.drop_nans()
shape: (3,)
a
i64
1
null
2
m2.drop_nulls()
shape: (3,)
a
f64
1.0
NaN
2.0
m1.interpolate()
shape: (3,)
a
f64
1.0
1.5
2.0
m2.interpolate() # np.nan is not considered missing, so why interpolate?
shape: (3,)
a
f64
1.0
NaN
2.0

Shape Transformations

s.to_dummies(drop_first=True)
shape: (3, 2)
a_2 a_3
u8 u8
0 0
1 0
0 1
s.shift(1, fill_value=999)
shape: (3,)
a
i64
999
1
2
s.shift(-1)
shape: (3,)
a
i64
2
3
null
pl.Series("a",[1,2,3,4]).reshape((2,2))
shape: (2,)
a
list[i64]
[1, 2]
[3, 4]

Arithmetic Operations

This section shows arithmetic operations on Polars series, which is not chainable. For chainable arithmetic, see Section 5.7.1.

s+1
shape: (3,)
a
i64
2
3
4
s-1
shape: (3,)
a
i64
0
1
2
s*2
shape: (3,)
a
i64
2
4
6
s/2
shape: (3,)
a
f64
0.5
1.0
1.5
s//2
shape: (3,)
a
i64
0
1
1
s%2
shape: (3,)
a
i64
1
0
1

Mathematical Transformations

s.abs()
shape: (3,)
a
i64
1
2
3
s.sin()
shape: (3,)
a
f64
0.841471
0.909297
0.14112
s.exp()
shape: (3,)
a
f64
2.718282
7.389056
20.085537
s.hash()
shape: (3,)
a
u64
4765690112321800547
13496407243000087834
3120322361086630706

But see here for a Polars extensions for hashing.

s.log()
shape: (3,)
a
f64
0.0
0.693147
1.098612
s.sqrt()
shape: (3,)
a
f64
1.0
1.414214
1.732051

Comparisons

Compare objects

s.equals(pl.Series("a", [1, 2, 3]))
True
s.equals([1,2,3])
AttributeError: 'list' object has no attribute '_s'

Compare entires in an object

s.eq(2)
shape: (3,)
a
bool
false
true
false
s.eq([1,2,3]) # same as s==[1,2,3]
shape: (3,)
a
bool
true
true
true

I will always prefer the chainable version of a comparison.

s.ge(2) # same as s >= 2
shape: (3,)
a
bool
false
true
true

Clip, aka Winsorize.

f.clip(lower_bound=1.5,upper_bound=2.5)
shape: (3,)
a
f64
1.5
2.0
2.5
f.round(2)
shape: (3,)
a
f64
1.0
2.0
3.0
f.ceil()
shape: (3,)
a
f64
1.0
2.0
3.0
f.floor()
shape: (3,)
a
f64
1.0
2.0
3.0

Apply (map_elements)

Applying your own function. Also note the informative error message (introduced in Polars Ver 0.18.0) that will try to recommend a more efficient way to do things.

s.map_elements(lambda x: x + 1)
/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_85769/2438314803.py:1: PolarsInefficientMapWarning:


Series.map_elements is significantly slower than the native series API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - s.map_elements(lambda x: ...)
with this one instead:
  + s + 1

shape: (3,)
a
i64
2
3
4

Are lambda functions really so much slower?

s1 = pl.Series(np.random.randn(int(1e6)))

Adding 1 with apply:

%timeit -n2 -r2 s1.map_elements(lambda x: x + 1)
142 ms ± 3.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

Adding 1 without apply:

%timeit -n2 -r2 s1+1
828 µs ± 84.1 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)

Cumulative Operations

s.cum_max()
shape: (3,)
a
i64
1
2
3
s.cum_sum()
shape: (3,)
a
i64
1
3
6
s.cum_prod()
shape: (3,)
a
i64
1
2
6
s.ewm_mean(com=0.5)
/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_85769/799281854.py:1: DeprecationWarning:

The default value for `ignore_nulls` for `ewm` methods will change from True to False in the next breaking release. Explicitly set `ignore_nulls=True` to keep the existing behavior and silence this warning.
shape: (3,)
a
f64
1.0
1.75
2.615385

Differentiation Operations

s.diff()
shape: (3,)
a
i64
null
1
1
s.pct_change()
shape: (3,)
a
f64
null
1.0
0.5

Rolling Operations

Aka Sliding Window operations.

s.rolling_mean(window_size=2)
shape: (3,)
a
f64
null
1.5
2.5
s.rolling_sum(window_size=2)
shape: (3,)
a
i64
null
3
5
s.rolling_map(
  sum, 
  window_size=2)
shape: (3,)
a
i64
null
3
5

Note:

  • sum is the pl.sum() function. You cannot use arbitrary functions within a rolling_map() call.
  • Many rolling functions have been prepared. See the computations section of the Series class in the official API.

Uniques and Duplicates

s.is_duplicated()
shape: (3,)
a
bool
false
false
false
s.is_unique()
shape: (3,)
a
bool
true
true
true
s.unique() # Same as Pandas drop_duplicates()
shape: (3,)
a
i64
1
2
3
s.n_unique()
3
pl.Series([1,2,3,4,1]).unique_counts()
shape: (4,)
u32
2
1
1
1
s.is_first_distinct() # not sure there is a pl.Expr method 
shape: (3,)
a
bool
true
true
true

Notes:

  • is_first_distinct() has had many name changes in the past. It was is_first() in Polars 0.18.0, and is_first_distinct() in Polars 0.19.0.

  • Do not confuse .is_first_distinct() with .first(). The former is a logical aggregation, and the latter is a series method.

(
  pl.DataFrame(pl.Series("a",[1,2,3,1]))
  .select(
    pl.col('a').first()
    )
)
shape: (1, 1)
a
i64
1

Casting

s.cast(pl.Int32)
shape: (3,)
a
i32
1
2
3

Things to note:

  • cast() is Polars’ equivalent of Pandas’ astype().
  • The dtypes to cast to are Polars dtypes. Don’t try s.cast("int32"), s.cast(np.int32), or s.cast(pd.int)
  • For a list of dtypes see the official documentation.

Find the most efficient dtype for a series; Like Pandas pd.to_numeric(..., downcast="...").

s.shrink_dtype().dtype # 
Int8

Also see here.

Ordering and Sorting

s.sort()
shape: (3,)
a
i64
1
2
3
s.reverse()
shape: (3,)
a
i64
3
2
1
s.rank()
shape: (3,)
a
f64
1.0
2.0
3.0
s.arg_sort()  # same as R's order()
shape: (3,)
a
u32
0
1
2

arg_sort() returns the indices that would sort the series. Same as R’s order().

sorted_s = s[s.arg_sort()]
(s.sort() == sorted_s).all()
True

Sampling

With replacement.

s.sample(2, with_replacement=False)
shape: (2,)
a
i64
1
3

Without replacement.

s.shuffle(seed=1) # random permutation
shape: (3,)
a
i64
2
1
3

Date and Time

There are 4 datetime dtypes in Polars:

  1. Date: A date, without hours. Generated with pl.Date().
  2. Datetime: Date and hours. Generated with pl.Datetime().
  3. Time: Hour of day. Generated with pl.Time().
  4. Duration: As the name suggests. Similar to timedelta in Pandas. Generated with pl.Duration().
Warning

Python has a sea of modules that support datetimes. Polars can perfectly coexist with datetime. I am ot so sure about other datetime modules such as dateutil, numpy, Pandas, arrow, scikits.timeseries (now deprecated)…

Time Range

from datetime import datetime, timedelta

date = (
  pl.datetime_range(
    start = datetime(
      year= 2001, month=2, day=2, hour =2, minute=24, second=12
      ), 
    end = datetime(
      year=2002, month=2, day=5, hour =5, minute=34, second=45
      ),
    interval='1s', 
    eager= True, 
  )
  .sample(5)
)

date
shape: (5,)
literal
datetime[μs]
2001-08-21 22:12:47
2001-06-15 10:47:13
2001-11-10 22:58:30
2001-12-10 09:43:07
2001-07-16 08:17:44

Things to note:

  • pl.date_range may return a series of dtype Date or Datetime. This depens of the granularity of the inputs.
  • Try other datatime dtypes as start and end.
date.dtype
Datetime(time_unit='us', time_zone=None)

Cast to different time unit. May be useful when joining datasets, and the time unit is different.

date.dt.cast_time_unit(time_unit="ms")
shape: (5,)
literal
datetime[ms]
2001-08-21 22:12:47
2001-06-15 10:47:13
2001-11-10 22:58:30
2001-12-10 09:43:07
2001-07-16 08:17:44

Datetime methods are accessed with the .dt namespace.

Extract Time Sub-Units

date.dt.second()
shape: (5,)
literal
i8
47
13
30
7
44
date.dt.minute()
shape: (5,)
literal
i8
12
47
58
43
17
date.dt.hour()
shape: (5,)
literal
i8
22
10
22
9
8
date.dt.day()
shape: (5,)
literal
i8
21
15
10
10
16
date.dt.week()
shape: (5,)
literal
i8
34
24
45
50
29
date.dt.weekday()
shape: (5,)
literal
i8
2
5
6
1
1
date.dt.month()
shape: (5,)
literal
i8
8
6
11
12
7
date.dt.year()
shape: (5,)
literal
i32
2001
2001
2001
2001
2001
date.dt.ordinal_day() # day in year
shape: (5,)
literal
i16
233
166
314
344
197
date.dt.quarter()
shape: (5,)
literal
i8
3
2
4
4
3

Round and Truncate Time

What if I don’t want the month-in-year, rather, the year-month. Enter .dt.truncate(). and the more experimental .dt.round().

pl.DataFrame(
  {'raw': date,
  'truncated': date.dt.truncate("1mo"),
  'rounded': date.dt.round("1mo"),
  }
)
shape: (5, 3)
raw truncated rounded
datetime[μs] datetime[μs] datetime[μs]
2001-08-21 22:12:47 2001-08-01 00:00:00 2001-09-01 00:00:00
2001-06-15 10:47:13 2001-06-01 00:00:00 2001-06-01 00:00:00
2001-11-10 22:58:30 2001-11-01 00:00:00 2001-11-01 00:00:00
2001-12-10 09:43:07 2001-12-01 00:00:00 2001-12-01 00:00:00
2001-07-16 08:17:44 2001-07-01 00:00:00 2001-07-01 00:00:00

Things to note:

  • You can think of .dt.truncate() as a floor operation on dates, and .dt.round() as a round operation.
  • Unlike strftime() type methods, the outoput of .dt.round() and .dt.truncate() is a Datetime dtype; not a string.

Durations

Equivalent to Pandas period dtype.

diffs = date.diff()
diffs
shape: (5,)
literal
duration[μs]
null
-67d -11h -25m -34s
148d 12h 11m 17s
29d 10h 44m 37s
-147d -1h -25m -23s
diffs.dtype
Duration(time_unit='us')
Important

The extrator of sub-units from a pl.Duration has recently changed from .dt.X() to .dt.total_X().

diffs.dt.total_seconds()
shape: (5,)
literal
i64
null
-5829934
12831077
2544277
-12705923
diffs.dt.total_minutes()
shape: (5,)
literal
i64
null
-97165
213851
42404
-211765
diffs.dt.total_days()
shape: (5,)
literal
i64
null
-67
148
29
-147
diffs.dt.total_hours()
shape: (5,)
literal
i64
null
-1619
3564
706
-3529

Date Aggregations

Note that aggregating dates, returns a datetime type object.

date.dt.max()
datetime.datetime(2001, 12, 10, 9, 43, 7)
date.dt.min()
datetime.datetime(2001, 6, 15, 10, 47, 13)

I never quite undersootd that is the “average day.”

date.dt.mean()
datetime.datetime(2001, 9, 8, 9, 59, 52, 200000)
date.dt.median()
datetime.datetime(2001, 8, 21, 22, 12, 47)

Adding Constant Periods

Adding and subtracting a constant period (“offset”):

date.dt.offset_by(by="-100y")
shape: (5,)
literal
datetime[μs]
1901-08-21 22:12:47
1901-06-15 10:47:13
1901-11-10 22:58:30
1901-12-10 09:43:07
1901-07-16 08:17:44
date.dt.offset_by(by="1y2m20d")
shape: (5,)
literal
datetime[μs]
2002-09-10 22:14:47
2002-07-05 10:49:13
2002-11-30 23:00:30
2002-12-30 09:45:07
2002-08-05 08:19:44

From Date to String

date.dt.to_string("%Y-%m-%d")
shape: (5,)
literal
str
"2001-08-21"
"2001-06-15"
"2001-11-10"
"2001-12-10"
"2001-07-16"

Or equivalently:

date.dt.strftime("%Y-%m-%d")
shape: (5,)
literal
str
"2001-08-21"
"2001-06-15"
"2001-11-10"
"2001-12-10"
"2001-07-16"

From String to Datetime

sd = pl.Series(
    "date",
    [
        "2021-04-22",
        "2022-01-04 00:00:00",
        "01/31/22",
        "Sun Jul  8 00:34:60 2001",
    ],
)
sd
shape: (4,)
date
str
"2021-04-22"
"2022-01-04 00:00:00"
"01/31/22"
"Sun Jul 8 00:34:60 2001"

Parse into Date type.

sd.str.to_date(format="%F", strict=False)
shape: (4,)
date
date
2021-04-22
null
null
null

Or equivalently:

sd.str.strptime(dtype= pl.Date, format="%F", strict=False)
shape: (4,)
date
date
2021-04-22
null
null
null
sd.str.strptime(pl.Date, "%D", strict=False)
shape: (4,)
date
date
null
null
2022-01-31
null

Parse into Datetime type.

sd.str.to_datetime(format="%F %T", strict=False)
shape: (4,)
date
datetime[μs]
null
2022-01-04 00:00:00
null
null

Or equivalently:

sd.str.strptime(pl.Datetime, "%F %T", strict=False)
shape: (4,)
date
datetime[μs]
null
2022-01-04 00:00:00
null
null
sd.str.strptime(pl.Datetime, "%a %h %d %T %Y",strict=False)
shape: (4,)
date
datetime[μs]
null
null
null
2001-07-08 00:35:00

Parse into Time dtype.

sd.str.to_time("%a %h %d %T %Y",strict=False)
shape: (4,)
date
time
null
null
null
00:35:00
sd.str.strptime(pl.Time, "%a %h %d %T %Y", strict=False)
shape: (4,)
date
time
null
null
null
00:35:00

Strings

String methods are accessed with the .str namespace.

st = pl.Series("a", ["foo", "bar", "baz"])
st.str.len_chars() # gets number of chars. In ASCII this is the same as lengths()
shape: (3,)
a
u32
3
3
3
st.str.concat("-")
shape: (1,)
a
str
"foo-bar-baz"
st.str.count_matches(pattern= 'o') # count literal matches
shape: (3,)
a
u32
2
0
0
st.str.contains("foo|tra|bar") 
shape: (3,)
a
bool
true
true
false
st.str.contains("ba[a-zA-Z]") 
shape: (3,)
a
bool
false
true
true
st.str.contains("[a-zA-Z]{4,5}") 
shape: (3,)
a
bool
false
false
false
st.str.count_matches(pattern='[a-zA-Z]')
shape: (3,)
a
u32
3
3
3
st.str.ends_with("oo")
shape: (3,)
a
bool
true
false
false
st.str.starts_with("fo")
shape: (3,)
a
bool
true
false
false

To extract the first appearance of a pattern, use extract:

url = pl.Series("a", [
            "http://vote.com/ballon_dor?candidate=messi&ref=Polars",

            "http://vote.com/ballon_dor?candidate=jorginho&ref=Polars",

            "http://vote.com/ballon_dor?candidate=ronaldo&ref=Polars"
            ])
url.str.extract("=([a-zA-Z]+)", 1) 
# "=([a-zA-Z]+)" is read: "match an equality, followed by any number of alphanumerics".
shape: (3,)
a
str
"messi"
"jorginho"
"ronaldo"

To extract all appearances of a pattern, use extract_all:

url.str.extract_all("=(\w+)") # \w is a shorthand for [a-zA-Z0-9_], i.e., alphanumerics and underscore.
shape: (3,)
a
list[str]
["=messi", "=Polars"]
["=jorginho", "=Polars"]
["=ronaldo", "=Polars"]
st.str.pad_end(8, "*")
shape: (3,)
a
str
"foo*****"
"bar*****"
"baz*****"
st.str.pad_start(8, "*")
shape: (3,)
a
str
"*****foo"
"*****bar"
"*****baz"
st.str.strip_chars_start('f')
shape: (3,)
a
str
"oo"
"bar"
"baz"
st.str.strip_chars_end('r')
shape: (3,)
a
str
"foo"
"ba"
"baz"

Replacing first appearance of a pattern:

st.str.replace("o+", "ZZ")
shape: (3,)
a
str
"fZZ"
"bar"
"baz"

Replace all appearances of a pattern:

st.str.replace_all("o", "ZZ")
shape: (3,)
a
str
"fZZZZ"
"bar"
"baz"

String to list of strings. Number of splits inferred.

st.str.split(by="o")
shape: (3,)
a
list[str]
["f", "", ""]
["bar"]
["baz"]
st.str.split(by="a", inclusive=True)
shape: (3,)
a
list[str]
["foo"]
["ba", "r"]
["ba", "z"]

String to dict of strings (actually a Polars Struct, see Section 5.13). Number of splits fixed.

st.str.split_exact("a", 2)
shape: (3,)
a
struct[3]
{"foo",null,null}
{"b","r",null}
{"b","z",null}

String to dict of strings. Length of output fixed.

st.str.splitn("a", 4)
shape: (3,)
a
struct[4]
{"foo",null,null,null}
{"b","r",null,null}
{"b","z",null,null}

Strip white spaces.

pl.Series(['   ohh   ','   yeah   ']).str.strip_chars()
shape: (2,)
str
"ohh"
"yeah"
st.str.to_uppercase()
shape: (3,)
a
str
"FOO"
"BAR"
"BAZ"
st.str.to_lowercase()
shape: (3,)
a
str
"foo"
"bar"
"baz"
st.str.to_titlecase()
shape: (3,)
a
str
"Foo"
"Bar"
"Baz"
st.str.zfill(5)
shape: (3,)
a
str
"00foo"
"00bar"
"00baz"
st.str.slice(offset=1, length=1)
shape: (3,)
a
str
"o"
"a"
"a"

Done with pl.Series!

Polars (Eager) DataFrames

Recall :

  1. Eager frames are in-memory. They will thus feel like Pandas frames, accessed with a PySpark’ish syntax.
  2. There is no row index (like R’s data.frame, data.table, and tibble; unlike Python’s Pandas).

A full list of DataFrame methods can be found here.

Create

A frame can be created as you would expect. From a dictionary of series, a numpy array, a Pandas dataframe, or a list of Polars (or Pandas) series, etc. Here, we create from a dict of Python lists.

df = pl.DataFrame({
  "integer": [1, 2, 3], 
  "date": [
    (datetime(2022, 1, 1)), 
    (datetime(2022, 1, 2)), 
    (datetime(2022, 1, 3))], 
    "float": [4.0, 5.0, 6.0],
    "string": ["a", "b", "c"]})

Inspect

Nice HTML printing to iPython.

df
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
print(df)
shape: (3, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   ┆ a      │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   ┆ b      │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   ┆ c      │
└─────────┴─────────────────────┴───────┴────────┘

I confess I like to look at the frame this way

df.to_pandas()
integer date float string
0 1 2022-01-01 4.0 a
1 2 2022-01-02 5.0 b
2 3 2022-01-03 6.0 c
df.glimpse() # useful for wide frames. Similar to R's str() of Pandas's .info()
Rows: 3
Columns: 4
$ integer          <i64> 1, 2, 3
$ date    <datetime[μs]> 2022-01-01 00:00:00, 2022-01-02 00:00:00, 2022-01-03 00:00:00
$ float            <f64> 4.0, 5.0, 6.0
$ string           <str> 'a', 'b', 'c'
df.columns
['integer', 'date', 'float', 'string']
df.shape
(3, 4)
df.height # probably more useful than df.shape[0]
3
df.width
4
df.schema # similar to Pandas info()
OrderedDict([('integer', Int64),
             ('date', Datetime(time_unit='us', time_zone=None)),
             ('float', Float64),
             ('string', String)])
df.with_row_index()
shape: (3, 5)
index integer date float string
u32 i64 datetime[μs] f64 str
0 1 2022-01-01 00:00:00 4.0 "a"
1 2 2022-01-02 00:00:00 5.0 "b"
2 3 2022-01-03 00:00:00 6.0 "c"

Things to note:

  1. df.schema and df.columns will be available for lazy objects, even before materializing them.
  2. df.height and df.shape will not be available for lazy objects, until they are materialized.

Intro to Column Operations

Important

This is probably the most important section of the document.

Contexts

As discussed in Section 3.4.1, operations on columns will always be done within a context.

  • df.select() to select and transform columns.
  • df.with_columns() to return all columns after transformations.
  • df.group_by().agg() is acutually .select() within a group_by() context.
  • df.filter() is a context for filtering rows.
  • pl.when() is a context for conditional operations.

Select Context:

df.select(pl.col("integer"))
shape: (3, 1)
integer
i64
1
2
3
df.with_columns(pl.col("integer").add(1))
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-01 00:00:00 4.0 "a"
3 2022-01-02 00:00:00 5.0 "b"
4 2022-01-03 00:00:00 6.0 "c"

Group-by context:

df.group_by("string").agg(pl.col("integer").mean())
shape: (3, 2)
string integer
str f64
"c" 3.0
"a" 1.0
"b" 2.0

Exctacting Columns as Expressions

Within a context there are many ways to access a column. Here are some examples of various ways of adding 1 to the integer column.

df.select(pl.col("integer").add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(pl.col(["integer"]).add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(pl.col(r"^integer$").add(1))
shape: (3, 1)
integer
i64
2
3
4

Things to note:

  • pl.col() is the most general way to access a column. It will work in all contexts.
  • pl.col() can accept a string, a list of strings, a regex, etc.
  • You don’t really need the “r” prefix in r"^integer$". I’ve put it to emphasize that it is a regex. What you do need is to start the string with ^ and end it with $ so that Polars knows this is a RegEx.
  • pl.col() with RegEx is a super power! Help your future self by naming columns so you can easily reference them with RegEx.
df.select(pl.col(pl.Int64).add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(pl.first().add(1))
shape: (3, 1)
integer
i64
2
3
4
import polars.selectors as cs

df.select(cs.by_name("integer").add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(cs.ends_with("ger").add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(cs.starts_with('int').add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(cs.integer().add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(cs.first().add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(cs.matches(r"^integer$").add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(cs.contains("int").add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(pl.col("integer")+1)
shape: (3, 1)
integer
i64
2
3
4

Exctracting Coumns as Series or Frames

df['integer']+1
shape: (3,)
integer
i64
2
3
4
df.get_column('integer')+1
shape: (3,)
integer
i64
2
3
4

The following will not work because a series does not have a .add(1) method.

df['integer'].add(1)
df.get_column('integer').add(1)

Convert to Other Python Objects

You can always convert your pl.Series or pl.DataFrame to other Python objects.

To Pandas DataFrame

df.to_pandas()
integer date float string
0 1 2022-01-01 4.0 a
1 2 2022-01-02 5.0 b
2 3 2022-01-03 6.0 c

To Numpy Array

df.to_numpy()
array([[1, datetime.datetime(2022, 1, 1, 0, 0), 4.0, 'a'],
       [2, datetime.datetime(2022, 1, 2, 0, 0), 5.0, 'b'],
       [3, datetime.datetime(2022, 1, 3, 0, 0), 6.0, 'c']], dtype=object)

To List of Polars Series

df.get_columns() # columns as list of Polars series
[shape: (3,)
 Series: 'integer' [i64]
 [
    1
    2
    3
 ],
 shape: (3,)
 Series: 'date' [datetime[μs]]
 [
    2022-01-01 00:00:00
    2022-01-02 00:00:00
    2022-01-03 00:00:00
 ],
 shape: (3,)
 Series: 'float' [f64]
 [
    4.0
    5.0
    6.0
 ],
 shape: (3,)
 Series: 'string' [str]
 [
    "a"
    "b"
    "c"
 ]]

To list of tuples

df.rows() 
[(1, datetime.datetime(2022, 1, 1, 0, 0), 4.0, 'a'),
 (2, datetime.datetime(2022, 1, 2, 0, 0), 5.0, 'b'),
 (3, datetime.datetime(2022, 1, 3, 0, 0), 6.0, 'c')]

To Dict of Polars Series

df.to_dict() # columns as dict of Polars series
{'integer': shape: (3,)
 Series: 'integer' [i64]
 [
    1
    2
    3
 ],
 'date': shape: (3,)
 Series: 'date' [datetime[μs]]
 [
    2022-01-01 00:00:00
    2022-01-02 00:00:00
    2022-01-03 00:00:00
 ],
 'float': shape: (3,)
 Series: 'float' [f64]
 [
    4.0
    5.0
    6.0
 ],
 'string': shape: (3,)
 Series: 'string' [str]
 [
    "a"
    "b"
    "c"
 ]}

To Dict of Python Lists

df.to_dict(as_series=False) # columns as dict of Polars series
{'integer': [1, 2, 3],
 'date': [datetime.datetime(2022, 1, 1, 0, 0),
  datetime.datetime(2022, 1, 2, 0, 0),
  datetime.datetime(2022, 1, 3, 0, 0)],
 'float': [4.0, 5.0, 6.0],
 'string': ['a', 'b', 'c']}

To String Representation (repr)

df.to_init_repr()
'pl.DataFrame(\n    [\n        pl.Series("integer", [1, 2, 3], dtype=pl.Int64),\n        pl.Series("date", [datetime.datetime(2022, 1, 1, 0, 0), datetime.datetime(2022, 1, 2, 0, 0), datetime.datetime(2022, 1, 3, 0, 0)], dtype=pl.Datetime(time_unit=\'us\', time_zone=None)),\n        pl.Series("float", [4.0, 5.0, 6.0], dtype=pl.Float64),\n        pl.Series("string", [\'a\', \'b\', \'c\'], dtype=pl.String),\n    ]\n)\n'

To a Polars Series of Polars Struct

df.to_struct()
shape: (3,)
struct[4]
{1,2022-01-01 00:00:00,4.0,"a"}
{2,2022-01-02 00:00:00,5.0,"b"}
{3,2022-01-03 00:00:00,6.0,"c"}

To PyArrow Table

df.to_arrow()
pyarrow.Table
integer: int64
date: timestamp[us]
float: double
string: large_string
----
integer: [[1,2,3]]
date: [[2022-01-01 00:00:00.000000,2022-01-02 00:00:00.000000,2022-01-03 00:00:00.000000]]
float: [[4,5,6]]
string: [["a","b","c"]]

Statistical Aggregations

df.describe()
shape: (9, 5)
statistic integer date float string
str f64 str f64 str
"count" 3.0 "3" 3.0 "3"
"null_count" 0.0 "0" 0.0 "0"
"mean" 2.0 "2022-01-02 00:00:00" 5.0 null
"std" 1.0 null 1.0 null
"min" 1.0 "2022-01-01 00:00:00" 4.0 "a"
"25%" 2.0 "2022-01-02 00:00:00" 5.0 null
"50%" 2.0 "2022-01-02 00:00:00" 5.0 null
"75%" 3.0 "2022-01-03 00:00:00" 6.0 null
"max" 3.0 "2022-01-03 00:00:00" 6.0 "c"

Compare to Pandas: Polars will summarize all columns even if they are not numeric.

df.to_pandas().describe()
integer date float
count 3.0 3 3.0
mean 2.0 2022-01-02 00:00:00 5.0
min 1.0 2022-01-01 00:00:00 4.0
25% 1.5 2022-01-01 12:00:00 4.5
50% 2.0 2022-01-02 00:00:00 5.0
75% 2.5 2022-01-02 12:00:00 5.5
max 3.0 2022-01-03 00:00:00 6.0
std 1.0 NaN 1.0

Statistical aggregations operate column-wise (and in parallel!).

df.max()
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
3 2022-01-03 00:00:00 6.0 "c"
df.min()
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
df.mean()
shape: (1, 4)
integer date float string
f64 datetime[μs] f64 str
2.0 2022-01-02 00:00:00 5.0 null
df.median()
shape: (1, 4)
integer date float string
f64 datetime[μs] f64 str
2.0 2022-01-02 00:00:00 5.0 null
df.quantile(0.1)
shape: (1, 4)
integer date float string
f64 datetime[μs] f64 str
1.0 null 4.0 null
df.sum()
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
6 null 15.0 null

Constrast with summation in row:

df.with_columns(pl.sum_horizontal('*'))
shape: (3, 4)
integer date float string
str datetime[μs] f64 str
"16409952000000014.0a" 2022-01-01 00:00:00 4.0 "a"
"16410816000000025.0b" 2022-01-02 00:00:00 5.0 "b"
"16411680000000036.0c" 2022-01-03 00:00:00 6.0 "c"

Selections

  1. If you are used to Pandas, recall there is no index. There is thus no need for loc vs. iloc, reset_index(), etc. See here for a comparison of extractors between Polars and Pandas.
  2. Filtering and selection is possible with the [ operator, or the filter() and select() methods. The latter is recommended to facilitate query planning (discussed in Section 2.3).

Selecting With Indices

The following are presented for completeness. Gnerally, you can, and should, avoid selecting with indices. See Section 5.6.2 for selecting columns, and Section 5.6.5 for selecting rows.

df[0,0] # like Pandas .iloc[]
1

Slicing along rows.

df[0:1] 
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"

Slicing along columns.

df[:,0:1]
shape: (3, 1)
integer
i64
1
2
3

Selecting Columns

First- do you want to return a Polars frame or a Polars series?

For a frame:

df.select("integer")
shape: (3, 1)
integer
i64
1
2
3

For a series:

df['integer']
shape: (3,)
integer
i64
1
2
3

How do I know which is which?

  1. You can use type().
  2. Notice the dimension of the index in the output.

Select columns with list of labels

df.select(["integer", "float"])
shape: (3, 2)
integer float
i64 f64
1 4.0
2 5.0
3 6.0

As of Polars>=15.0.0, you don’t have to pass a list:

df.select("integer", "float")
shape: (3, 2)
integer float
i64 f64
1 4.0
2 5.0
3 6.0

Column slicing by label

df[:,"integer":"float"]
shape: (3, 3)
integer date float
i64 datetime[μs] f64
1 2022-01-01 00:00:00 4.0
2 2022-01-02 00:00:00 5.0
3 2022-01-03 00:00:00 6.0

Note: df.select() does not support slicing ranges such as df.select("integer":"float").

Get a column as a Polars Series.

df.get_column('integer')
shape: (3,)
integer
i64
1
2
3

Get a column as a Polars series.

df.to_series(0)
shape: (3,)
integer
i64
1
2
3
df.get_column_index('float')
2
df.drop("integer")
shape: (3, 3)
date float string
datetime[μs] f64 str
2022-01-01 00:00:00 4.0 "a"
2022-01-02 00:00:00 5.0 "b"
2022-01-03 00:00:00 6.0 "c"

df.drop() not have an inplace argument. Use df.drop_in_place() instead.

pl.col()

The pl.col() is super important. It allows you to select columns in many ways, and provides almost all the methods (i.e. Polars Expressions) you will need to operate on them.

df.select(pl.col(pl.Int64))
shape: (3, 1)
integer
i64
1
2
3
df.select(pl.col(pl.Float64))
shape: (3, 1)
float
f64
4.0
5.0
6.0
df.select(pl.col(pl.Utf8))
shape: (3, 1)
string
str
"a"
"b"
"c"

Python List of Polars dtypes

df.select(pl.col([pl.Int64, pl.Float64]))
shape: (3, 2)
integer float
i64 f64
1 4.0
2 5.0
3 6.0

Patterns (“glob”)

df.select(pl.col("*")) # same as df.select(pl.all())
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"

Regular Expression. Important! Use pl.col('^<patterh>$') to match regular expressions.

df.select(pl.col("^\w{4}$")) 
shape: (3, 1)
date
datetime[μs]
2022-01-01 00:00:00
2022-01-02 00:00:00
2022-01-03 00:00:00
df.select(pl.col('^.*g$'))
shape: (3, 1)
string
str
"a"
"b"
"c"
df.select(pl.col("^.*te.*$")) # regex matching anything with a "te"
shape: (3, 2)
integer date
i64 datetime[μs]
1 2022-01-01 00:00:00
2 2022-01-02 00:00:00
3 2022-01-03 00:00:00

You can use pl.col() to exclude columns.

df.select(pl.col("*").exclude("integer"))
shape: (3, 3)
date float string
datetime[μs] f64 str
2022-01-01 00:00:00 4.0 "a"
2022-01-02 00:00:00 5.0 "b"
2022-01-03 00:00:00 6.0 "c"
df.select(pl.col("*").exclude(pl.Float64))
shape: (3, 3)
integer date string
i64 datetime[μs] str
1 2022-01-01 00:00:00 "a"
2 2022-01-02 00:00:00 "b"
3 2022-01-03 00:00:00 "c"

Exciting! New API for column selection.

import polars.selectors as cs

df.select(cs.starts_with('i'))
shape: (3, 1)
integer
i64
1
2
3
df.select(cs.starts_with('i') | cs.starts_with('d'))
shape: (3, 2)
integer date
i64 datetime[μs]
1 2022-01-01 00:00:00
2 2022-01-02 00:00:00
3 2022-01-03 00:00:00
df.select(cs.starts_with('i') | cs.starts_with('d'))
shape: (3, 2)
integer date
i64 datetime[μs]
1 2022-01-01 00:00:00
2 2022-01-02 00:00:00
3 2022-01-03 00:00:00

Selecting Rows By Index

df.limit(2)
shape: (2, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
df.head(2)
shape: (2, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
df.tail(1)
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
3 2022-01-03 00:00:00 6.0 "c"
df.gather_every(2)
shape: (2, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
3 2022-01-03 00:00:00 6.0 "c"
df.slice(offset=1, length=1)
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-02 00:00:00 5.0 "b"
df.sample(1)
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"

Because .sample() requires row counts, it will not work for lazy objects.

Get row as tuple.

df.row(1)
(2, datetime.datetime(2022, 1, 2, 0, 0), 5.0, 'b')

Get row as dict

df.row(1, named=True)
{'integer': 2,
 'date': datetime.datetime(2022, 1, 2, 0, 0),
 'float': 5.0,
 'string': 'b'}

Selecting Rows By Condition

Aka Projection.

Enter the df.filter() context.

df.filter(pl.col("integer").eq(2))
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-02 00:00:00 5.0 "b"

Things to note:

  • df.filter() is a Polars Context.
  • It is a keep filter, not a drop filter: it will evaluate expressions, and return the rows where the expression does not evaluate to False.
  • The [ operator does not support indexing with boolean such as df[df["integer"] == 2].
  • The filter() method is recommended over [ by the authors of Polars, to facilitate lazy evaluation (discussed later).

An alternative syntax for equality filtering, known as constraint in the Polars documentation.

df.filter(integer = 2)
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-02 00:00:00 5.0 "b"

AND conditions:

df.filter(
  pl.col('integer').eq(2),
  pl.col('float').gt(10)
)
shape: (0, 4)
integer date float string
i64 datetime[μs] f64 str
df.filter(
  pl.col('integer').eq(2) &
  pl.col('float').gt(10)
)
shape: (0, 4)
integer date float string
i64 datetime[μs] f64 str

OR conditions:

df.filter(
  pl.col('integer').eq(2) |
  pl.col('float').gt(10)
)
shape: (1, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-02 00:00:00 5.0 "b"
Note

How would you write an AND, or OR condition, without using the comparison methods .eq(), .gt(), etc.?

Selecting From Single Item Frame

Say your operation returned a Polars frame with a single float, which you want to manipulate as a Python float:

pl.DataFrame([1]).item()
1

Column Transformations

  1. Transformations are done with Polars Expressions (Section 7) within a context (see Section 5.3.1).
  • The output column will have the same name as the input, unless you use the alias() method to rename it.
df.with_columns(
    pl.col("integer").mul(2),
    pl.col("integer").alias("integer2"),
    integer3 = pl.col("integer").truediv(3),
)
shape: (3, 6)
integer date float string integer2 integer3
i64 datetime[μs] f64 str i64 f64
2 2022-01-01 00:00:00 4.0 "a" 1 0.333333
4 2022-01-02 00:00:00 5.0 "b" 2 0.666667
6 2022-01-03 00:00:00 6.0 "c" 3 1.0

Warning!

df.with_columns(
    integer3 = pl.col("integer").truediv(3),
    pl.col("integer").mul(2),
    pl.col("integer").alias("integer2"),
)

Things to note:

  • You cannot use [ to assign! This would not have worked df['integer3'] = df['integer'] * 2
  • The columns integer is multiplied by 2 in place, because no alias is used.
  • As of Polars version >15.. (I think), you can use = to assign. That is how integer3 is created.
  • The column integer is copied, by renaming it to integer2.
  • Why .truediv()? To distinguish from .floordiv() and .mod(). See this issue for a discussion of the topic.

If a selection returns multiple columns, all will be transformed:

df.with_columns(
    pl.col([pl.Int64,pl.Float64]).mul(2)
)
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-01 00:00:00 8.0 "a"
4 2022-01-02 00:00:00 10.0 "b"
6 2022-01-03 00:00:00 12.0 "c"
df.with_columns(
    pl.all().cast(pl.Utf8)
)
shape: (3, 4)
integer date float string
str str str str
"1" "2022-01-01 00:00:00.000000" "4.0" "a"
"2" "2022-01-02 00:00:00.000000" "5.0" "b"
"3" "2022-01-03 00:00:00.000000" "6.0" "c"

You cannot .alias() when operating on multiple columns. But you can use .name.suffix() or .name.prefix() from the .name. namespace.

df.with_columns(
    pl.col([pl.Int64,pl.Float64]).mul(2).name.suffix("_2X")
)
shape: (3, 6)
integer date float string integer_2X float_2X
i64 datetime[μs] f64 str i64 f64
1 2022-01-01 00:00:00 4.0 "a" 2 8.0
2 2022-01-02 00:00:00 5.0 "b" 4 10.0
3 2022-01-03 00:00:00 6.0 "c" 6 12.0

Arithmetic

df.select(pl.col('integer').add(1))
shape: (3, 1)
integer
i64
2
3
4
df.select(pl.col('integer').sub(1))
shape: (3, 1)
integer
i64
0
1
2
df.select(pl.col('integer').mul(2))
shape: (3, 1)
integer
i64
2
4
6
df.select(pl.col('integer').truediv(2))
shape: (3, 1)
integer
f64
0.5
1.0
1.5

Transform the Transformed Columns

All the expressions within a context see the frame as it’s initial state. Recall, odds are expressions will be evaluated in parallel, and not sequentially. So how can I operate on columns I have just transformed? By chaining contexts!

(
  df
  .with_columns(
    pl.col("integer").truediv(pl.col("float")).alias("ratio")
  )
  .with_columns(
    pl.col("ratio").mul(100)
  )
)
shape: (3, 5)
integer date float string ratio
i64 datetime[μs] f64 str f64
1 2022-01-01 00:00:00 4.0 "a" 25.0
2 2022-01-02 00:00:00 5.0 "b" 40.0
3 2022-01-03 00:00:00 6.0 "c" 50.0

Conditional Transformation (if-else)

df.with_columns(
    pl.when(
      pl.col("integer").gt(2)
    )
    .then(pl.lit(1))
    .otherwise(pl.col("integer"))
    .alias("new_col")
)
shape: (3, 5)
integer date float string new_col
i64 datetime[μs] f64 str i64
1 2022-01-01 00:00:00 4.0 "a" 1
2 2022-01-02 00:00:00 5.0 "b" 2
3 2022-01-03 00:00:00 6.0 "c" 1

Things to note:

  • When you think of it, pl.when().then().otherwise() is a pl.Expr() methdod; one that is not available for pl.Series.
  • The otherwise() method is optional. If omitted, the original column will be returned (see next example).
  • pl.lit(1) is a Polars expression that returns the literal 1. It may be ommited, but it is good practice to include it for clarity and safety.
  • pl.col("integer").gt(2) could have been replaced with pl.col("integer") > 2. I like the former because it allows easier composition of conditions.
df.with_columns(
    pl.when(
      pl.col("integer") > 2
    )
    .then(1)
    .otherwise(pl.col("integer"))
    .alias("new_col")
)
shape: (3, 5)
integer date float string new_col
i64 datetime[μs] f64 str i64
1 2022-01-01 00:00:00 4.0 "a" 1
2 2022-01-02 00:00:00 5.0 "b" 2
3 2022-01-03 00:00:00 6.0 "c" 1

Python Lambda Functions

Apply your own lambda function.

(
  df
  .select([pl.col("integer"), pl.col("float")])
  .map_rows(lambda x: x[0] + x[1])
)
shape: (3, 1)
map
f64
5.0
7.0
9.0

As usual, using your own functions may have a very serious toll on performance.

df_big = pl.DataFrame(np.random.randn(1000000, 2), schema=["a", "b"]) # previous versions used columns= instead of schema=
%timeit -n2 -r2 df_big.sum_horizontal()
1.48 ms ± 89.4 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
%timeit -n2 -r2 df_big.map_rows(lambda x: x[0] + x[1])
229 ms ± 603 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)

Numpy Ufuncs

You can use Numpy’s universal functions (ufuncs) on Polars frames. There is little overhead in using Numpy ufuncs.

  • See here to use Numpy ufuncs in Polars.
  • See here to create your own Numpy ufunc’s.

Applying off-the-shelf Numpy ufuncs is as simple.

df.select(pl.col('integer').pipe(np.sin))
# the same as 
# df.select(np.sin(pl.col('integer')))
shape: (3, 1)
integer
f64
0.841471
0.909297
0.14112

Writing your own Numpy ufunc is easy.

def myfunc(x):
  return x**2 + 2*x + 1

# make myfunc a ufunc
myfunc_ufunc = np.frompyfunc(myfunc, 1, 1)

df.select(
  pl.col('float').pipe(myfunc_ufunc, casting='unsafe')
  )
shape: (3, 1)
float
f64
25.0
36.0
49.0

Things to note:

  • My Ufunc is created with np.frompyfunc(). It could have also been created with np.vectorize().
  • The casting='unsafe' argument is required, to deal with dtype mismatch. There could be a more elegant way, but I did not find it.
  • Calling a Ufunc with multiple arguments is slightly more involved, and I currently did not find a “clean” solution.

Uniques and Duplicates

Keep uniques; same as pd.drop_duplicates().

df.unique() 
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
2 2022-01-02 00:00:00 5.0 "b"
1 2022-01-01 00:00:00 4.0 "a"
3 2022-01-03 00:00:00 6.0 "c"

Can be used with column subset

df.unique(["integer", "float"])
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
df.is_unique()
shape: (3,)
bool
true
true
true
df.is_duplicated()
shape: (3,)
bool
false
false
false
df.n_unique()
3

Missing

Make some data with missing.

df_with_nulls = df.with_columns(
    null_1 = pl.Series("missing", [3, None, np.nan]),
    null_2 = pl.Series("missing", [None, 5, 6]),
)
df_with_nulls.null_count() # same as pd.isnull().sum()
shape: (1, 6)
integer date float string null_1 null_2
u32 u32 u32 u32 u32 u32
0 0 0 0 1 1
df_with_nulls.drop_nulls() # same as pd.dropna()
shape: (1, 6)
integer date float string null_1 null_2
i64 datetime[μs] f64 str f64 i64
3 2022-01-03 00:00:00 6.0 "c" NaN 6

Can I also drop np.nan’s? There is no drop_nan() method. See StackOverflow for workarounds.

df_with_nulls.fill_null(0) # same as pd.fillna(0)
shape: (3, 6)
integer date float string null_1 null_2
i64 datetime[μs] f64 str f64 i64
1 2022-01-01 00:00:00 4.0 "a" 3.0 0
2 2022-01-02 00:00:00 5.0 "b" 0.0 5
3 2022-01-03 00:00:00 6.0 "c" NaN 6

But recall that None and np.nan are not the same thing.

df_with_nulls.fill_nan(99)
shape: (3, 6)
integer date float string null_1 null_2
i64 datetime[μs] f64 str f64 i64
1 2022-01-01 00:00:00 4.0 "a" 3.0 null
2 2022-01-02 00:00:00 5.0 "b" null 5
3 2022-01-03 00:00:00 6.0 "c" 99.0 6
df_with_nulls.interpolate()
shape: (3, 6)
integer date float string null_1 null_2
f64 datetime[μs] f64 str f64 f64
1.0 2022-01-01 00:00:00 4.0 "a" 3.0 null
2.0 2022-01-02 00:00:00 5.0 "b" NaN 5.0
3.0 2022-01-03 00:00:00 6.0 "c" NaN 6.0

Sorting

df.sort(by=["integer","float"])
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
df.reverse()
shape: (3, 4)
integer date float string
i64 datetime[μs] f64 str
3 2022-01-03 00:00:00 6.0 "c"
2 2022-01-02 00:00:00 5.0 "b"
1 2022-01-01 00:00:00 4.0 "a"

Groupby

High level:

  • df.group_by() is a context, for grouping. Just like Pandas, only parallelized, etc. The output will have as many rows as the number of groups.
  • df.partion_by() will return a list of frames.
  • pl.select(pl.col().expression().over()) is like Pandas df.groupby.transform(): will will not collapse rows in the original frame. Rather, it will assign each row in the original frame with the aggregate in the group. The output will have the same number of rows as the input.

Aggregations:

The group_by() context will be followed by an aggregation:

  1. df.group_by().agg(): an aggregating context.
  2. df.group_by().map_groups(): to apply your own function to each group. Replaced df.group_by().apply().

You will usually use the .agg() context. Your syntax will usually look like pl.col().some_aggregation().some_chained_expressions(). The aggregations you may use include almost all the pl.Series aggregations in Section 4.4, but also some pl.Expr aggregations, such as pl.first(), pl.last().

Grouping over time:

Think of these as round+group, where you need to state the resolution of temporal rounding.

  • df.grouby_rolling() for rolling window grouping, a.k.a. a sliding window. Each row will be assigned the aggregate in the window.
  • df.group_by_dynamic() for dynamic grouping. Each period will be assigned the agregate in the period. The output may have more rows than the input.

See the API reference for the various options. Also see the user guide for more details.

df2 = pl.DataFrame({
    "integer": [1, 1, 2, 2, 3, 3],
    "float": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
    "string": ["a", "b", "c", "c", "d", "d"],
    "datetime": [
        (datetime(2022, 1, 4)), 
        (datetime(2022, 1, 4)), 
        (datetime(2022, 1, 4)), 
        (datetime(2022, 1, 9)), 
        (datetime(2022, 1, 9)), 
        (datetime(2022, 1, 9))],
})

group_by()

groupper = df2.group_by("integer")
groupper.agg(
  cs.numeric().sum().name.suffix("_sum"),
  pl.col('string').n_unique().name.suffix("_n_unique"),
  pl.col('string').last().alias("last_string"),
  pl.col('string'),
)
shape: (3, 5)
integer float_sum string_n_unique last_string string
i64 f64 u32 str list[str]
1 3.0 2 "b" ["a", "b"]
3 11.0 1 "d" ["d", "d"]
2 7.0 1 "c" ["c", "c"]

Things to note:

  • Don’t expect an index. This is Polars, not Pandas.
  • The grouping may be saved as an object, and used later.
  • The group_by() and .agg() contexts offer the usual functonality of .select() and .with_columns(). In particular, you can use pl.col() to access columns, or the pl.selector module (cs).
  • In the selector is not followed by an expression, it will collapse the Series to a Polars List (see Section 5.13).

Some random/useful examples now follow.

Examples

The count (length) of each group:

groupper.agg(pl.len())
shape: (3, 2)
integer len
i64 u32
1 2
3 2
2 2

When operating on all columns,

groupper.len()
shape: (3, 2)
integer len
i64 u32
2 2
1 2
3 2
groupper.sum()
shape: (3, 4)
integer float string datetime
i64 f64 str datetime[μs]
2 7.0 null null
3 11.0 null null
1 3.0 null null

over()

You may be familar with Pandas group_by().transform(), which will return a frame with the same row-count as its input. You may be familiar with Postgres SQL window function. You may not be familiar with neither, and still want to aggregate within group, but propagate the result to all group members. Polars’ .over() is the answer.

df2.with_columns(
  pl.col("float").sum().over("string").alias("sum")
)
shape: (6, 5)
integer float string datetime sum
i64 f64 str datetime[μs] f64
1 1.0 "a" 2022-01-04 00:00:00 1.0
1 2.0 "b" 2022-01-04 00:00:00 2.0
2 3.0 "c" 2022-01-04 00:00:00 7.0
2 4.0 "c" 2022-01-09 00:00:00 7.0
3 5.0 "d" 2022-01-09 00:00:00 11.0
3 6.0 "d" 2022-01-09 00:00:00 11.0

Things to note:

  • The output will have the same number of rows as the input.
  • .over() is a context. As such, you can evaluate column selectors and expressions within it.
  • Careful: over() should be the last operation in a chain. @over-wrong will sadly not fail, while it should have.
df2.with_columns(
  pl.col("float").over("string").sum().alias("sum")
)
shape: (6, 5)
integer float string datetime sum
i64 f64 str datetime[μs] f64
1 1.0 "a" 2022-01-04 00:00:00 21.0
1 2.0 "b" 2022-01-04 00:00:00 21.0
2 3.0 "c" 2022-01-04 00:00:00 21.0
2 4.0 "c" 2022-01-09 00:00:00 21.0
3 5.0 "d" 2022-01-09 00:00:00 21.0
3 6.0 "d" 2022-01-09 00:00:00 21.0

partition_by()

Make the list of frames

partitioner = df2.partition_by("integer")
partitioner
[shape: (2, 4)
 ┌─────────┬───────┬────────┬─────────────────────┐
 │ integer ┆ float ┆ string ┆ datetime            │
 │ ---     ┆ ---   ┆ ---    ┆ ---                 │
 │ i64     ┆ f64   ┆ str    ┆ datetime[μs]        │
 ╞═════════╪═══════╪════════╪═════════════════════╡
 │ 1       ┆ 1.0   ┆ a      ┆ 2022-01-04 00:00:00 │
 │ 1       ┆ 2.0   ┆ b      ┆ 2022-01-04 00:00:00 │
 └─────────┴───────┴────────┴─────────────────────┘,
 shape: (2, 4)
 ┌─────────┬───────┬────────┬─────────────────────┐
 │ integer ┆ float ┆ string ┆ datetime            │
 │ ---     ┆ ---   ┆ ---    ┆ ---                 │
 │ i64     ┆ f64   ┆ str    ┆ datetime[μs]        │
 ╞═════════╪═══════╪════════╪═════════════════════╡
 │ 2       ┆ 3.0   ┆ c      ┆ 2022-01-04 00:00:00 │
 │ 2       ┆ 4.0   ┆ c      ┆ 2022-01-09 00:00:00 │
 └─────────┴───────┴────────┴─────────────────────┘,
 shape: (2, 4)
 ┌─────────┬───────┬────────┬─────────────────────┐
 │ integer ┆ float ┆ string ┆ datetime            │
 │ ---     ┆ ---   ┆ ---    ┆ ---                 │
 │ i64     ┆ f64   ┆ str    ┆ datetime[μs]        │
 ╞═════════╪═══════╪════════╪═════════════════════╡
 │ 3       ┆ 5.0   ┆ d      ┆ 2022-01-09 00:00:00 │
 │ 3       ┆ 6.0   ┆ d      ┆ 2022-01-09 00:00:00 │
 └─────────┴───────┴────────┴─────────────────────┘]

The iterate like any Python list, with a function that operates on Polars frames:

def myfunc(
  df: pl.DataFrame
  ) -> pl.DataFrame:
  return df.select(pl.col("float").sum())

for df in partitioner: 
  print(myfunc(df))
shape: (1, 1)
┌───────┐
│ float │
│ ---   │
│ f64   │
╞═══════╡
│ 3.0   │
└───────┘
shape: (1, 1)
┌───────┐
│ float │
│ ---   │
│ f64   │
╞═══════╡
│ 7.0   │
└───────┘
shape: (1, 1)
┌───────┐
│ float │
│ ---   │
│ f64   │
╞═══════╡
│ 11.0  │
└───────┘

Grouping on time

High level:

  1. Grouping on time is special, because a temporal variable implies multiple resolutions which may be used for grouping. E.g. a date may be grouped by year, month, day, etc.
  2. For a temporal version of group_by(), use df.group_by_dynamic().
  3. For a temporal version of over(), use df.rolling().
(
  df2
  .sort("datetime")
  .group_by_dynamic(
    index_column="datetime", 
    every="1d",
    )
  .agg(pl.col("float").sum())
)
shape: (2, 2)
datetime float
datetime[μs] f64
2022-01-04 00:00:00 6.0
2022-01-09 00:00:00 15.0
(
  df2
  .sort("datetime")
  .rolling(
    index_column="datetime", 
    period='1d',
    )
  .agg(pl.col("float").sum())
)
shape: (6, 2)
datetime float
datetime[μs] f64
2022-01-04 00:00:00 6.0
2022-01-04 00:00:00 6.0
2022-01-04 00:00:00 6.0
2022-01-09 00:00:00 15.0
2022-01-09 00:00:00 15.0
2022-01-09 00:00:00 15.0

Joins

High level:

  • df.join() for joins; like Pandas pd.merge() or df.join().
  • df.join_asof() for asof joins; like Pandas pd.merge_asof().
  • df.hstack() for horizontal concatenation; like Pandas pd.concat([],axis=1) or R’s cbind.
  • df.vstack() for vertical concatenation; like Pandas pd.concat([],axis=0) or R’s rbind.
  • df.merge_sorted() for vertical stacking, with sorting.
  • pl.concat(), which is similar to the previous two, but with memory re-chunking. pl.concat() also allows diagonal concatenation, if columns are not shared.
  • df.extend() for vertical concatenation, but with memory re-chunking. Similar to df.vstack().rechunk().

For more on the differences between these methods, see here.

join()

df = pl.DataFrame({
  "integer": [1, 2, 3], 
  "date": [
    (datetime(2022, 1, 1)), 
    (datetime(2022, 1, 2)), 
    (datetime(2022, 1, 3))], 
    "float": [4.0, 5.0, 6.0],
    "string": ["a", "b", "c"]})


df2 = pl.DataFrame({
  "integer": [1, 2, 3], 
  "date": [
    (datetime(2022, 1, 4)), 
    (datetime(2022, 1, 5)), 
    (datetime(2022, 1, 6))], 
    "float":[7.0, 8.0, 9.0],
    "string": ["d", "d", "d"]})


df.join(
  df2, 
  on="integer", 
  how="left",
  validate='m:1'
  )
shape: (3, 7)
integer date float string date_right float_right string_right
i64 datetime[μs] f64 str datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a" 2022-01-04 00:00:00 7.0 "d"
2 2022-01-02 00:00:00 5.0 "b" 2022-01-05 00:00:00 8.0 "d"
3 2022-01-03 00:00:00 6.0 "c" 2022-01-06 00:00:00 9.0 "d"

Things to note:

  • Repeating column names have been suffixed with “_right”.
  • Recall, there are no indices. The on/left_on/right_on argument is always required.
  • how= may take the following values: ‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’. ‘inner’ is the default.
  • I like to add the validate= argument, for safety.
  • The join is super fast, as demonstrated in Section 2 above.

join_asof()

df.join_asof(
    df2, 
    left_on="date", 
    right_on='date', 
    by="integer", 
    strategy="backward", 
    tolerance='1w',
    )
shape: (3, 6)
integer date float string float_right string_right
i64 datetime[μs] f64 str f64 str
1 2022-01-01 00:00:00 4.0 "a" null null
2 2022-01-02 00:00:00 5.0 "b" null null
3 2022-01-03 00:00:00 6.0 "c" null null

Things to note:

  • Yes! join_asof() is available. In streaming engine as well!
  • The strategy= argument may take the following values: ‘backward’, ‘forward’.
  • The tolerance= argument may take the following values: ‘1w’, ‘1d’, ‘1h’, ‘1m’, ‘1s’, ‘1ms’, ‘1us’, ‘1ns’.

hstack

new_column = pl.Series("c", np.repeat(1, df.height))

df.hstack([new_column])
shape: (3, 5)
integer date float string c
i64 datetime[μs] f64 str i64
1 2022-01-01 00:00:00 4.0 "a" 1
2 2022-01-02 00:00:00 5.0 "b" 1
3 2022-01-03 00:00:00 6.0 "c" 1

vstack

df.vstack(df2)
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"

Caution: Joining along rows is possible only if matched columns have the same dtype. Timestamps may be tricky because they may have different time units. Recall that timeunits may be cast before joining using series.dt.cast_time_unit(). Here is a demonstration of the problem:

(
  df
  .vstack(
    df2
    .with_columns(
      pl.col('date').dt.cast_time_unit(time_unit="ms")
      )
    )
)

merge_sorted

This is a vertical stacking, when expecting a sorted result, and assuming inputs are sorted.

df.merge_sorted(df2, key="integer") 
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-02 00:00:00 5.0 "b"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-03 00:00:00 6.0 "c"
3 2022-01-06 00:00:00 9.0 "d"

concat()

Vertical

pl.concat([df, df2]) 
# equivalent to:
# pl.concat([df, df2], how='vertical', rechunk=True, parallel=True) 
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"

Horizontal

pl.concat(
  [df,new_column.to_frame()], 
  how='horizontal',
  )
shape: (3, 5)
integer date float string c
i64 datetime[μs] f64 str i64
1 2022-01-01 00:00:00 4.0 "a" 1
2 2022-01-02 00:00:00 5.0 "b" 1
3 2022-01-03 00:00:00 6.0 "c" 1

Diagonal

pl.concat(
  [df,new_column.to_frame()], 
  how='diagonal')
shape: (6, 5)
integer date float string c
i64 datetime[μs] f64 str i64
1 2022-01-01 00:00:00 4.0 "a" null
2 2022-01-02 00:00:00 5.0 "b" null
3 2022-01-03 00:00:00 6.0 "c" null
null null null null 1
null null null null 1
null null null null 1

What is the difference between pl.concat() and df.vstack() and hstack()? pl.concat() is more general with more functionality.

  • concat includes re-chunking, which is useful for memory management.
  • concat includes diagonal concatenation, which is useful when columns are not shared.
  • concat includes parallel execution, which is useful for performance.
  • concat has recenetly been equipped with how=vertial_relaxed and how=horizontal_relaxed, which finds a common dtype if schemas are mismatched.

extend()

Like vstack(), but with memory re-chunking. Similar to df.vstack().rechunk().

df.extend(df2) 
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"

Why should I care about re-chunking? Since Polar is a columnar store, it is important to have contiguous memory layout. Otherwise, you may not enjoy the benefits of vectorized operations.

Is there a method for df.hstack().rechunk()? No. A columnar storage is not sensitive to emory framgmentation along columns.

Nested dtypes

Recall the nested dtypes:

  1. Polars Struct: Like a Python dict within a cell; Multiple named elements.
  2. Polars List: Similar to a Python list within a cell; Multiple unnamed elements. Unlike the Python list, in a Polars list, all elements must have the same dtype.
  3. Polars Array: Like a Polars list, but with a fixed length for all cells in the column.

Polars Struct

Why would I want to use a Polars Struct?

  1. To call functions that expects a dict argument constructed from multiple columns. E.g. a Ufunc called using pl.col().pipe() or pl.col().map_elements(). See here.
  2. To use column methods on multiple columns: hashing, checking uniqueness, etc.
  3. As the output of a an operation that returns multiple columns. E.g. value_counts() within a group_by() context.
  4. When information cannot be structured into tabular form.

See here for more information on Polars Structs.

Make a Polars Struct

df.with_columns(
  pl.struct(
    pl.col("integer"),
    pl.col("float"),
    pl.col("string"),
    )
  .alias("struct")
  )
shape: (6, 5)
integer date float string struct
i64 datetime[μs] f64 str struct[3]
1 2022-01-01 00:00:00 4.0 "a" {1,4.0,"a"}
2 2022-01-02 00:00:00 5.0 "b" {2,5.0,"b"}
3 2022-01-03 00:00:00 6.0 "c" {3,6.0,"c"}
1 2022-01-04 00:00:00 7.0 "d" {1,7.0,"d"}
2 2022-01-05 00:00:00 8.0 "d" {2,8.0,"d"}
3 2022-01-06 00:00:00 9.0 "d" {3,9.0,"d"}

Or more compactly:

df.with_columns(
  pl.struct('integer', 'float', 'string').alias('struct')
  )
shape: (6, 5)
integer date float string struct
i64 datetime[μs] f64 str struct[3]
1 2022-01-01 00:00:00 4.0 "a" {1,4.0,"a"}
2 2022-01-02 00:00:00 5.0 "b" {2,5.0,"b"}
3 2022-01-03 00:00:00 6.0 "c" {3,6.0,"c"}
1 2022-01-04 00:00:00 7.0 "d" {1,7.0,"d"}
2 2022-01-05 00:00:00 8.0 "d" {2,8.0,"d"}
3 2022-01-06 00:00:00 9.0 "d" {3,9.0,"d"}

Now add field names:

df_with_struct = df.select(
  pl.struct(
    integer=pl.col("integer"),
    float=pl.col("float"),
    string=pl.col("string"),
    )
  .alias("struct")
  )

df_with_struct
shape: (6, 1)
struct
struct[3]
{1,4.0,"a"}
{2,5.0,"b"}
{3,6.0,"c"}
{1,7.0,"d"}
{2,8.0,"d"}
{3,9.0,"d"}

Convert struct to string in JSON format

df_with_struct.select(
  pl.col("struct").struct.json_encode()
  )
shape: (6, 1)
struct
str
"{"integer":1,"float":4.0,"string":"a"}"
"{"integer":2,"float":5.0,"string":"b"}"
"{"integer":3,"float":6.0,"string":"c"}"
"{"integer":1,"float":7.0,"string":"d"}"
"{"integer":2,"float":8.0,"string":"d"}"
"{"integer":3,"float":9.0,"string":"d"}"

Get field names

df_with_struct["struct"].struct.fields
['integer', 'float', 'string']
Note

In @struct-fields, I used a Series method. For reasons I do not understand, struct.fields has currently only a pl.Series version, and not a pl.Expr version.

Accessing fields

df_with_struct.select(
  pl.col("struct").struct.field("integer")
  )
shape: (6, 1)
integer
i64
1
2
3
1
2
3

Struct to columns with unnest()

df_with_struct.unnest(columns=["struct"])
shape: (6, 3)
integer float string
i64 f64 str
1 4.0 "a"
2 5.0 "b"
3 6.0 "c"
1 7.0 "d"
2 8.0 "d"
3 9.0 "d"

Polars List

When will I want to use a Polars List?

  1. When I want to hold variable length data in a single cell. E.g. a list of tags, a list of items in a shopping cart, etc.

See here for more information on Polars Lists.

Make a Polars Series with strings

pl.Series(["Alice, Bob, Charlie", "David, Eve, Frank", "George, Helen, Ida, Jack, Kate"])

df_with_long_strings = pl.DataFrame(
  {"employees": [
    "Alice, Bob, Charlie", 
    "David, Eve, Frank", 
    "George, Helen, Ida, Jack, Kate",
    "Liam, Mary, Noah, Olivia, Paul, Queen, Rose, Sam, Tom, Uma, Victor, Wendy, Xavier, Yara, Zane",
    "Abe, Ben, Cal, Dan, Ed, Fred, Gus, Hal, Ike, Joe, Ken, Lou, Max, Ned, Ollie, Pat, Quin, Ray, Sam, Tom, Ulf, Vic, Will, Xan, Yul, Zed",
    "Tim, Kim, Jim, Sim, Dim, Lim, Mim, Rim, Vim",
  ]}
)

pl.Config.set_fmt_str_lengths(200)
df_with_long_strings
shape: (6, 1)
employees
str
"Alice, Bob, Charlie"
"David, Eve, Frank"
"George, Helen, Ida, Jack, Kate"
"Liam, Mary, Noah, Olivia, Paul, Queen, Rose, Sam, Tom, Uma, Victor, Wendy, Xavier, Yara, Zane"
"Abe, Ben, Cal, Dan, Ed, Fred, Gus, Hal, Ike, Joe, Ken, Lou, Max, Ned, Ollie, Pat, Quin, Ray, Sam, Tom, Ulf, Vic, Will, Xan, Yul, Zed"
"Tim, Kim, Jim, Sim, Dim, Lim, Mim, Rim, Vim"

Break strings into list

df_with_list = df_with_long_strings.with_columns(
  pl.col("employees").str.split(", ").alias("employees")
  )

pl.Config.set_fmt_table_cell_list_len(100)
df_with_list
shape: (6, 1)
employees
list[str]
["Alice", "Bob", "Charlie"]
["David", "Eve", "Frank"]
["George", "Helen", "Ida", "Jack", "Kate"]
["Liam", "Mary", "Noah", "Olivia", "Paul", "Queen", "Rose", "Sam", "Tom", "Uma", "Victor", "Wendy", "Xavier", "Yara", "Zane"]
["Abe", "Ben", "Cal", "Dan", "Ed", "Fred", "Gus", "Hal", "Ike", "Joe", "Ken", "Lou", "Max", "Ned", "Ollie", "Pat", "Quin", "Ray", "Sam", "Tom", "Ulf", "Vic", "Will", "Xan", "Yul", "Zed"]
["Tim", "Kim", "Jim", "Sim", "Dim", "Lim", "Mim", "Rim", "Vim"]

Start Operating on List Elements

df_with_list.select(
  pl.col("employees").list.len().alias("n_employees"),
  pl.col("employees").list.first().alias("first_employee"),
  # pl.col("employees").list.get(0).alias("first_employee"), will also work
  pl.col("employees").list.last().alias("last_employee"),
  # pl.col("employees").list.get(-1).alias("last_employee"), will also work
  pl.col("employees").list.slice(offset=1, length=2).alias("second_and_third_employees"),
  pl.col("employees").list.contains("Alice").alias("contains_Alice")
)
shape: (6, 5)
n_employees first_employee last_employee second_and_third_employees contains_Alice
u32 str str list[str] bool
3 "Alice" "Charlie" ["Bob", "Charlie"] true
3 "David" "Frank" ["Eve", "Frank"] false
5 "George" "Kate" ["Helen", "Ida"] false
15 "Liam" "Zane" ["Mary", "Noah"] false
26 "Abe" "Zed" ["Ben", "Cal"] false
9 "Tim" "Vim" ["Kim", "Jim"] false

Things to note:

  1. Use .list. to access list methods.
  2. The full list of list methods is here.

explode()

Polars List columns may be “exploded”, i.e. broken into rows of a single list element each.

df_with_list.explode("employees").shape
(61, 1)

List-to-Struct and Struct-to List

Sometimes, you need to convert a list to a struct. For instance, if you want to unnest() a list, you will need to convert it to a struct first. Alternatively, if you want named access to list elements.

List to Struct

(
  df_with_list
  .select(
    pl.col('employees').list.to_struct(
      n_field_strategy='max_width',
    )
  )
  .unnest('employees')
)
shape: (6, 26)
field_0 field_1 field_2 field_3 field_4 field_5 field_6 field_7 field_8 field_9 field_10 field_11 field_12 field_13 field_14 field_15 field_16 field_17 field_18 field_19 field_20 field_21 field_22 field_23 field_24 field_25
str str str str str str str str str str str str str str str str str str str str str str str str str str
"Alice" "Bob" "Charlie" null null null null null null null null null null null null null null null null null null null null null null null
"David" "Eve" "Frank" null null null null null null null null null null null null null null null null null null null null null null null
"George" "Helen" "Ida" "Jack" "Kate" null null null null null null null null null null null null null null null null null null null null null
"Liam" "Mary" "Noah" "Olivia" "Paul" "Queen" "Rose" "Sam" "Tom" "Uma" "Victor" "Wendy" "Xavier" "Yara" "Zane" null null null null null null null null null null null
"Abe" "Ben" "Cal" "Dan" "Ed" "Fred" "Gus" "Hal" "Ike" "Joe" "Ken" "Lou" "Max" "Ned" "Ollie" "Pat" "Quin" "Ray" "Sam" "Tom" "Ulf" "Vic" "Will" "Xan" "Yul" "Zed"
"Tim" "Kim" "Jim" "Sim" "Dim" "Lim" "Mim" "Rim" "Vim" null null null null null null null null null null null null null null null null null

When converting list to struct there are two things to consider: the number of fields and their names. The number of fields is governed by the n_field_strategy argument. The name of the fields is governed by the fields argument.

Polars Array

As of March 2024, Arrays are quite new to Polars. Currently, the behave mostly like lists, but with a fixed length. I can expect that the fixed length constraint will allow for more efficient memory allocation, and simpler API than the more general Polars Lists.

See here for more information on Polars Arrays.

Reshaping

High level:

  • df.transpose() as the name suggests.
  • df.melt() for wide to long.
  • df.pivot() for long to wide.
  • df.unnest() for breaking structs into columns.
  • df.unstack()

transpose()

Recall

df
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"

Since there are no indices, if you want to keep the column names, as a new column you need to use the include_header=True argument.

df.transpose(include_header=True)
shape: (4, 7)
column column_0 column_1 column_2 column_3 column_4 column_5
str str str str str str str
"integer" "1" "2" "3" "1" "2" "3"
"date" "2022-01-01 00:00:00.000000" "2022-01-02 00:00:00.000000" "2022-01-03 00:00:00.000000" "2022-01-04 00:00:00.000000" "2022-01-05 00:00:00.000000" "2022-01-06 00:00:00.000000"
"float" "4.0" "5.0" "6.0" "7.0" "8.0" "9.0"
"string" "a" "b" "c" "d" "d" "d"

Wide to Long- melt()

Some data in wide format

# The following example is adapted from Pandas documentation: https://Pandas.pydata.org/docs/reference/api/Pandas.wide_to_long.html

np.random.seed(123)
wide = pl.DataFrame({
    'famid': ["11", "12", "13"],
    'birth': [1, 2, 3],
    'ht1': [2.8, 2.9, 2.2],
    'ht2': [3.4, 3.8, 2.9]})

wide
shape: (3, 4)
famid birth ht1 ht2
str i64 f64 f64
"11" 1 2.8 3.4
"12" 2 2.9 3.8
"13" 3 2.2 2.9

Reshape into long format

wide.melt(
  id_vars=['famid', 'birth'], 
  value_vars=['ht1', 'ht2'], 
  variable_name='treatment', 
  value_name='height')
shape: (6, 4)
famid birth treatment height
str i64 str f64
"11" 1 "ht1" 2.8
"12" 2 "ht1" 2.9
"13" 3 "ht1" 2.2
"11" 1 "ht2" 3.4
"12" 2 "ht2" 3.8
"13" 3 "ht2" 2.9

Things to note:

  • id_vars are the columns that index the wide format; these will be repeated.
  • value_vars are the columns that hold information in the wide format; these will be stacked in the long format.
  • variable_name is the column name, in the long format, that holds the column names from the wide format. The indexing columns in the long format will thus be id_vars+variable_name.
  • value_name is the column name, in the long format, that holds the values after stacking.

Long to Wide- pivot()

# Example adapted from https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format

long = pl.DataFrame({
    'id': [1, 1, 2, 2, 3, 3],
    'treatment': ['A', 'B', 'A', 'B', 'A', 'B'],
    'height': [2.8, 2.9, 1.9, 2.2, 2.3, 2.1]
    })
  
long
shape: (6, 3)
id treatment height
i64 str f64
1 "A" 2.8
1 "B" 2.9
2 "A" 1.9
2 "B" 2.2
3 "A" 2.3
3 "B" 2.1

Pivot Without Aggregation

long.pivot(
  index='id', # index in the wide format
  columns='treatment', # defines columns in the wide format
  values='height',
  )
shape: (3, 3)
id A B
i64 f64 f64
1 2.8 2.9
2 1.9 2.2
3 2.3 2.1

If each combinatio of index= and columns= maps to more than a single value of values=, you can use the aggregate_function= argument of .pivot().

pivot() VS value_counts()

Here are two ways to count the number of unique values in a column, by group.

df_to_count = (
  pl.DataFrame({
    "group": ["A", "A", "A", "B", "B", "B"],
    "value": [1, 2, 3, 1, 2, 3],
    })
)

How to get the distribution of value by group?

The .pivot() method:

df_to_count.pivot(
  index="group",
  columns="value",
  values="value",
  aggregate_function="len",
  )
shape: (2, 4)
group 1 2 3
str u32 u32 u32
"A" 1 1 1
"B" 1 1 1

The .value_counts() method:

counts_with_value_counts = (
  df_to_count
  .group_by("group")
  .agg(
    pl.col("value").value_counts().alias("n_unique_values")
  )
)

counts_with_value_counts
shape: (2, 2)
group n_unique_values
str list[struct[2]]
"B" [{1,1}, {2,1}, {3,1}]
"A" [{2,1}, {3,1}, {1,1}]

Things to note:

  • The .pivot() methods return a column per unique value.
  • value_counts() returns a Polars-List of Polars-Struct per group. This may be a pro or a con; depending on your planned usage, or the number of unique values in each group. To see this, consider a case where each group has 1000 value, non overlapping with the other groups; how many columns will the .pivot() method return?

Here is my best attempt to extract the frequencies of each value. If you have a better solution, please let me know.

(
  counts_with_value_counts
  .with_columns(
    pl.col('n_unique_values').list.to_struct().struct.json_encode()
  )
)
shape: (2, 2)
group n_unique_values
str str
"B" "{"field_0":{"value":1,"count":1},"field_1":{"value":2,"count":1},"field_2":{"value":3,"count":1}}"
"A" "{"field_0":{"value":2,"count":1},"field_1":{"value":3,"count":1},"field_2":{"value":1,"count":1}}"

Long to Wide- unstack()

Another way to pivot from long to wide, which is less general but more efficient, is the unstack() method. Unstacking is much “dumber” than pivoting, and thus, much faster.

(
  long
  .sort('treatment')
  .unstack(
    step = 3,
    how='vertical',
    columns="height",
    )
)
shape: (3, 2)
height_0 height_1
f64 f64
2.8 2.9
1.9 2.2
2.3 2.1

unest()

See Section 5.13.1.

Dataframe in Memory

df.estimated_size(unit="mb")
0.0002288818359375
df.rechunk() # ensure contiguous memory layout
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"
df.n_chunks() # number of ChunkedArrays in the dataframe
1
df.shrink_to_fit() # reduce memory allocation to actual size
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"

I/O

Reading Frames from Disk

tl;dr

If you want an eager frame use pl.read_xxx(), and if you want a lazy frame use pl.scan_xxx().

In Detail

To read data from disk we need to think about the following:

  1. What is the taget object?
  2. What is the struture of the data on disk?
    1. Single file VS multiple files? If multiple files, how are they organized?
    2. Format of each file.
    3. Are the files remote or local?

If the target object is an eager frame you will want to use pl.read_xxx(). So to read a csv you will use pl.read_csv(), to read a parquet you will use pl.read_parquet(), etc.

If the target object is a lazy frame you will want to use pl.scan_xxx(). So to read a csv you will use pl.scan_csv(), to read a parquet you will use pl.scan_parquet(), etc.

If the data is stored in multiple files, scan_xxx() methods currently have better support. You can state the locations of the files in many ways. Note, however, that if you are used to Pandas that takes a path to folder and will import, recursively, all the files in the folder, Polars will not do that; it will expect a more precise glob.

For example, say the data is in

./

|– 2022/

|. |–01/

|. |–01.parquet

|. |–02.parquet

|. |–02/

|. |–01.parquet

|. |–02.parquet

In Pandas you could use pd.read_parquet("data/"). This will not work in Polars (TTBOMK), where you will need the following glob pl.scan_parquet("data/**/*.parquet)".

Multiple Parquet files also make an Apache Arrow Dataset. Polars will allow you to read an Arrow Dataset with pl.scan_arrow_dataset(). The syntax was used in Section 2.7.2 and looks like:

dset = ds.dataset("data/", format="parquet") 
pl.scan_pyarrow_dataset(dset)

When using this functionality you should recall the follwing:

  1. PyArrow has more functionality that Polars. As such, it may be able to read more formats, and more complex formats, where pl.scan_xxx() may fail.
  2. Polars native readers are more optimized than PyArrow. In particular, they push-down more operations, which is particularly imopirtant when reading from remote storage (i.e. DataLakes) such as S3 or Azure Blob Storage.

Why Parquet?

Why do I prefer Parquet over CSV, Pickle, Feather, or other formats?

I prefer Parquet over CSV because Parquet is compressed, and stores schema. Thus avoiding time and errors involved in guessing the schema at import time.

I prefer Parquet over Pickle because Parquet is a standard, and is not specific to a particular Python version. Also, being optimized for columnar storage, Parquet has better compression and read performance.

I Prefer Parquet over Feather because Parquet has better compression.

Writing Frames to Disk

tl;dr

An eager Polars frame will have a .write_xxx() method.

If the output of an opeation does not fit into memory, so that an eager frame cannot materialize, look into the .sink_parquet() engine. It will write to disk in a streaming fashion. Alas, it currently supports a very limited set of operations.

Databases

See here for the user guide.

Note you can (currently) only read Eager Frames from disk.

Polars Expressions

Now that you are familiar with column transformations in Polars’ contexts, it is time to tell you: all those transformations are Polars expressions.

Things to recall about Polars expressions:

  1. Think of them as functions of other Polars expressions. As such, expressions can be chained. 1. An expression, or chain thereof, meterializes when a .collect() method is called.
  2. Almost all pl.Series() methods are available as Polars expressions, and vice-versa. There are however exceptions.

Because almost all Polars Series methods are available as Polars expressions, we refer the reader to Section 4 for a review of importanta series methods. In this section we will focus on exceptions, and some important expressions that are not available for Series.

Polars LazyFrames

Recalling- a LazyFrame is a Polars DataFrame that has not been materialized. I.e., it is nothing but a plan to import some file from disk and to operate on it. It will only materialize when a .collect() method is called.

In the case I did not repeat it enough- LazyFrames allow you to process data that does not fit in your memory. This is a tremendous difference with Pandas, or Numpy, where the data is memory resident. Imagine processing 100GB of data in your MAcBook Air! Or your EC2 instance!

A full list of operations that are available for LazyFrames is available here.

LazyFrame Is Not a Single Computing Engine

As previously mentioned, Polars has multiple evaluation engines. The Eager Frames’ engine is the first. The evaluation of LazyFrames, can be done by multiple engines. The depend on:

  1. In/out of RAM: In memory or streaming processing?
  2. Reader: Polars native or PyArrow file readers?
  3. Output: Output to memory or disk?
In/out of RAM Reader Output Command
RAM Native RAM pl.scan_parquet()...collect()
Stream Native RAM pl.scan_parquet()....collect(streaming=True)
RAM PyArrow RAM pl.scan_pyarrow_dataset().collect()
Stream PyArrow RAM pl.scan_pyarrow_dataset().collect(streaming=True)
Stream Native Disk pl.scan_csv()....sink_parquet()

Things worth knowing:

  1. The authors of Polars make considerable effort so that the transition between engines is seamless to the user. This effort is usually successful, but exceptions exist. An easy way to debug your code, is first to try a different engine. In my experience, the RAM/Native/RAM engine is the most complete and robust.
  2. The Stream/Native/Disk engine is very very limited. Currently, it is really useful to convert from CSV to parquet.
  3. The PyArrow reader are the most general and stable to import into Arrow format. Not being Native to Polar, they may be less efficient.

Creating A Lazy Frame

A LazyFrame is created in two ways:

  1. By using a pl.scan_xxx() method to read data from disk.
  2. By the .lazy() method of an eager frame.

Here is a pl.scan_xxx()example

path_to_file = 'data/NYC/yellow_tripdata_2023-01.parquet'

f"{os.path.getsize(path_to_file)/1e7:.2f} MB on disk"
'4.77 MB on disk'
taxi_lazy = pl.scan_parquet(path_to_file)

taxi_lazy.limit(5).collect()
shape: (5, 19)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
i64 datetime[ns] datetime[ns] f64 f64 f64 str i64 i64 i64 f64 f64 f64 f64 f64 f64 f64 f64 f64
2 2023-01-01 00:32:10 2023-01-01 00:40:36 1.0 0.97 1.0 "N" 161 141 2 9.3 1.0 0.5 0.0 0.0 1.0 14.3 2.5 0.0
2 2023-01-01 00:55:08 2023-01-01 01:01:27 1.0 1.1 1.0 "N" 43 237 1 7.9 1.0 0.5 4.0 0.0 1.0 16.9 2.5 0.0
2 2023-01-01 00:25:04 2023-01-01 00:37:49 1.0 2.51 1.0 "N" 48 238 1 14.9 1.0 0.5 15.0 0.0 1.0 34.9 2.5 0.0
1 2023-01-01 00:03:48 2023-01-01 00:13:25 0.0 1.9 1.0 "N" 138 7 1 12.1 7.25 0.5 0.0 0.0 1.0 20.85 0.0 1.25
2 2023-01-01 00:10:29 2023-01-01 00:21:19 1.0 1.43 1.0 "N" 107 79 1 11.4 1.0 0.5 3.28 0.0 1.0 19.68 2.5 0.0

Operations on LazyFrames

Almost everything we did with eager frames, we can do with lazy frames.

If an operation fails, it may be it has not yet been implemented for the engine you are using. You will find that error messages are less informative for the streaming engine, than they are for the eager engine. Try changing the engine as a first step of debugging.

Here is a random example

(
  taxi_lazy # calling a lazy frame
  
  # do some operations....
  .select(
    pl.col(pl.Float64).mean().name.suffix("_mean"),
  )
  .collect() # don't forget to collect
  
  # Handle the printing
  .transpose(include_header=True)
  .to_pandas() # for nice printing
)
column column_0
0 passenger_count_mean 1.362532
1 trip_distance_mean 3.847342
2 RatecodeID_mean 1.497440
3 fare_amount_mean 18.367069
4 extra_mean 1.537842
5 mta_tax_mean 0.488290
6 tip_amount_mean 3.367941
7 tolls_amount_mean 0.518491
8 improvement_surcharge_mean 0.982085
9 total_amount_mean 27.020383
10 congestion_surcharge_mean 2.274231
11 airport_fee_mean 0.107409

Lazy Frame Attributes

Some information is avaialble from the metadata of a LazyFrame. This information is available without materializing the frame (at least with Parquet files). Notice the abcense of the collect() call in the following examples.

taxi_lazy.schema
OrderedDict([('VendorID', Int64),
             ('tpep_pickup_datetime',
              Datetime(time_unit='ns', time_zone=None)),
             ('tpep_dropoff_datetime',
              Datetime(time_unit='ns', time_zone=None)),
             ('passenger_count', Float64),
             ('trip_distance', Float64),
             ('RatecodeID', Float64),
             ('store_and_fwd_flag', String),
             ('PULocationID', Int64),
             ('DOLocationID', Int64),
             ('payment_type', Int64),
             ('fare_amount', Float64),
             ('extra', Float64),
             ('mta_tax', Float64),
             ('tip_amount', Float64),
             ('tolls_amount', Float64),
             ('improvement_surcharge', Float64),
             ('total_amount', Float64),
             ('congestion_surcharge', Float64),
             ('airport_fee', Float64)])
taxi_lazy.columns
['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'store_and_fwd_flag',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'airport_fee']
taxi_lazy.dtypes
[Int64,
 Datetime(time_unit='ns', time_zone=None),
 Datetime(time_unit='ns', time_zone=None),
 Float64,
 Float64,
 Float64,
 String,
 Int64,
 Int64,
 Int64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64,
 Float64]

Useful Tricks to Make The Most of LazyFrames

  1. Use Parquet or Feather. CSV and Pickle should be avoided.

  2. Store your data partitioned, along your frequent group_by’s and filters.

  3. Materialize as late as possible.

  4. Read about the options of the native readers. In particular:

  • Try the low_memory=True option of the native readers, before trying the streaming engine.
  • When reading multiple files with a glob pattern, set the rechunk=False default to True, if your memory allows it.

Processing Multiple Frames Simultanously

Q: What if you want to access a column from frame df, when processing frame df2?
A: Just join them.
Q: What if they are not joinable?
A: Use a diagonal join. Q: Can’t I just add a search-space into the lazy query? A: Ahhh! Use df.with_context().

df3 = pl.Series("blah", [100,2,3]).to_frame()

q = (
    df.lazy()
    .with_context( # add colums of df2 to the search space
        df3.lazy()
        )
    .with_columns(
        pl.col('float').map_dict(remapping={4.0:None}, default=100).fill_null(pl.col('blah').mean()).alias('float2'),
        )
    )

q.collect()
/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_85769/2186241628.py:9: DeprecationWarning:

`map_dict` is deprecated. It has been renamed to `replace`. The default behavior has changed to keep any values not present in the mapping unchanged. Pass `default=None` to keep existing behavior.

/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_85769/2186241628.py:9: DeprecationWarning:

`the argument remapping` for `map_dict` is deprecated. It has been renamed to `mapping`.
shape: (6, 5)
integer date float string float2
i64 datetime[μs] f64 str f64
1 2022-01-01 00:00:00 4.0 "a" 35.0
2 2022-01-02 00:00:00 5.0 "b" 100.0
3 2022-01-03 00:00:00 6.0 "c" 100.0
1 2022-01-04 00:00:00 7.0 "d" 100.0
2 2022-01-05 00:00:00 8.0 "d" 100.0
3 2022-01-06 00:00:00 9.0 "d" 100.0

Things to note:

  • with_context() is a lazy operation. This is great news, since it means both frames will benefit from query planning, etc.
  • with_context() will not copy the data, but rather, add a reference to the data.
  • Why not use pl.col('blah').mean() within the map_dict()? That is indeed more reasonable. It simply did not work.
  • Try it yourself: Can you use multiple with_context()?

Imputation Example

train_lf = pl.LazyFrame(
    {"feature_0": [-1.0, 0, 1], "feature_1": [-1.0, 0, 1]}
)
test_lf = pl.LazyFrame(
    {"feature_0": [-1.0, None, 1], "feature_1": [-1.0, 0, 1]}
)

(
  test_lf
  .with_context(
    train_lf
    .select(pl.all().suffix("_train")))
    .select(
      pl.col("feature_0")
      .fill_null(
        pl.col("feature_0_train").median()
      )
  )
).collect()
/var/folders/ld/j2ckdhnj5hqcx58byy0gj9th0000gn/T/ipykernel_85769/2751601251.py:12: DeprecationWarning:

`suffix` is deprecated. It has been moved to `name.suffix`.
shape: (3, 1)
feature_0
f64
-1.0
0.0
1.0

Misc

SQL

res = (
  pl.SQLContext(frame=df)
  .execute(
    "SELECT * FROM frame WHERE integer > 2"
  )
)
res.collect()
shape: (2, 4)
integer date float string
i64 datetime[μs] f64 str
3 2022-01-03 00:00:00 6.0 "c"
3 2022-01-06 00:00:00 9.0 "d"

Things to notes:

  • The name of the frame is the one regietered (frame) and the name of the object.
  • I suspect that the SQL context is not as optimized as the native Polars operations. I have very little experience with it.

For more on the SQL context seehere.

Plotting

The Plotting Backend

Requires the hvplot module.

df.plot()
df.plot.hist()

Things to note:

  • The default .plot() will behave like pd.df.plot(), i.e., return a line plot of all columns.
  • df.plot. will give you access to the usual plotting methods (of hvplot), such as hist(), scatter(), etc.

As Input to Plotting Libraries

My preferred plotting library us plotly, not hvplot. Since plotly is (currently) not a plotting backend for Polars, I will use the frame as an input to plotly functions.

px.line(
  df, 
  x="integer", 
  y="float",
  markers=True,
  )

Things to note:

  • Recent versions of Plotly can deal with a Polars frame as any other Pandas frame. In particular, it can exctract columns using their name, and use it as axis titles.

Tables

Polars can print tables in ASCII, or write in HTML to ipython notebooks.

print(df)
shape: (6, 4)
┌─────────┬─────────────────────┬───────┬────────┐
│ integer ┆ date                ┆ float ┆ string │
│ ---     ┆ ---                 ┆ ---   ┆ ---    │
│ i64     ┆ datetime[μs]        ┆ f64   ┆ str    │
╞═════════╪═════════════════════╪═══════╪════════╡
│ 1       ┆ 2022-01-01 00:00:00 ┆ 4.0   ┆ a      │
│ 2       ┆ 2022-01-02 00:00:00 ┆ 5.0   ┆ b      │
│ 3       ┆ 2022-01-03 00:00:00 ┆ 6.0   ┆ c      │
│ 1       ┆ 2022-01-04 00:00:00 ┆ 7.0   ┆ d      │
│ 2       ┆ 2022-01-05 00:00:00 ┆ 8.0   ┆ d      │
│ 3       ┆ 2022-01-06 00:00:00 ┆ 9.0   ┆ d      │
└─────────┴─────────────────────┴───────┴────────┘
df
shape: (6, 4)
integer date float string
i64 datetime[μs] f64 str
1 2022-01-01 00:00:00 4.0 "a"
2 2022-01-02 00:00:00 5.0 "b"
3 2022-01-03 00:00:00 6.0 "c"
1 2022-01-04 00:00:00 7.0 "d"
2 2022-01-05 00:00:00 8.0 "d"
3 2022-01-06 00:00:00 9.0 "d"

If, like myself, you are not satisfied with these options, you can gain more control on the printing of tables using the following:

  1. Exporting as Pandas dataframe.
  2. Posit’s GreatTables module.

Export as Pandas Dataframe

(
  df
  .to_pandas()
  .style
    .format({
      'float': '{:.1f}',
      'date': '{:%Y-%m-%d}'
      })
    .background_gradient(
      cmap='Reds', 
      axis=0,
      )
)
  integer date float string
0 1 2022-01-01 4.0 a
1 2 2022-01-02 5.0 b
2 3 2022-01-03 6.0 c
3 1 2022-01-04 7.0 d
4 2 2022-01-05 8.0 d
5 3 2022-01-06 9.0 d

Great Tables

import great_tables as gt

(
  gt.GT(
    df,
    # rowname_col="integer",
    )
  .tab_header(
    title="Nonsense Data", 
    subtitle="But looking good!")
  .fmt_number(columns="float", compact=True)
  .fmt_date(columns="date", date_style="wd_m_day_year")
  .tab_stubhead(label="integer")
  .data_color(
    columns=["float",'integer'],
    # domain=[1, 6],
    # palette=["rebeccapurple", "white", "orange"],
    # na_color="white",
)
  # .fmt_currency(columns=["open", "high", "low", "close"])
  .cols_hide(columns="string")

)
Nonsense Data
But looking good!
integer date float
1 Sat, Jan 1, 2022 4.00
2 Sun, Jan 2, 2022 5.00
3 Mon, Jan 3, 2022 6.00
1 Tue, Jan 4, 2022 7.00
2 Wed, Jan 5, 2022 8.00
3 Thu, Jan 6, 2022 9.00

ML

When doing ML with Polars frames there are two possibliities:

  1. Your learning function can ingest Polars frames. This is currently the exception.
  2. You will need to convert your Polars frame to a Numpy array, or a PyTorch tensor, or a Pandas frame, etc. This is currently the rule.

Example

import sklearn as sk
from sklearn.linear_model import LinearRegression

# predict `label` with `float`
X = df.select(["float"])
y = df["string"].to_dummies()

model = LinearRegression()
model.fit(X.to_numpy(), y.to_numpy())
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Can SKlearn ingest Polars Series? Yes.

model.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This is because SKlearn takes Array-Like objects, i.e., anything that np.asarray() can convert to a numpy array. This also means that with SKlearn, your learning is memory resident.

np.asarray(X)
array([[4.],
       [5.],
       [6.],
       [7.],
       [8.],
       [9.]])

Patsy

Patsy is a Python library for describing statistical models and building design matrices using R’s tilde syntax (y~X).

Patsy can already be used with Polars frames.

import patsy as pt

y, X = pt.dmatrices("float ~ integer", df)

model = LinearRegression()
model.fit(X, y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Polars-ds

Polars-ds is a Polars extension designed to give Polars some more Data-Science functionality. It is currently in development and worth following.

Writing Your Own Extensions

You can extend Polars in various ways. If you are familiar with Rusy, you can a plugin. If you just want your own functions to be accesible as methods, see here.